mastodon.xyz is one of the many independent Mastodon servers you can use to participate in the fediverse.
A Mastodon instance, open to everyone, but mainly English and French speaking.

Administered by:

Server stats:

743
active users

#bigquery

0 posts0 participants0 posts today

We (BBC) have a telemetry system to measure media streaming performance & a while back we ran an experiment to slurp off a copy of those metrics (rebuffering, media type, bit rates etc.) to my log pipeline which flows in to BigQuery.
Yesterday I cranked up the sample rate to BigQuery from 25% to 100% so we have a resilient system in case of incidents for Glastonbury, Wimbledon & the Euros (plus BAU content).
It's definitely bumped the log rate, maybe an extra billion lines today.
#BBC #BigQuery

How messy is #terraform with #gcp?

I'm trying to make a system where #rust worker ingests data of HTTP requests via #cloudrun, and passes them into #bigtable, which uses further ingestion recipe to export the data into #bigquery

I have tried to make a complete terraform declaration for this, but got into permission issues, then I have tried to make a system that generates all the artefacts like service accounts and docker image and then refers to those from terraform builds, but I almost don't see a value of doing it like that.

Does anyone have an example of #cloudrun #cdc? I am new to this and I feel really slow.

GA4 intraday exports and cookieless pings

I build a lot of reports for clients that use Big Query GA4 as source.

Now.. that works like a charm. But.. you will need to wait some time to get processed data from the events_ tables.

More recent data will appear in the streaming _intraday_ tables, if you have that enabled. But.. that data is not always complete! Especially when your site has consent mode enabled, and does not set a cookie until after consent.

Here’s how it works:

The scenario

Someone visits the site for the first time (source: some campaign), gets confronted with the cookie banner, and then clicks accept.

We tagged the site correctly, so this is what happens

  1. a page_view event triggers (with URL parameters) – and notices analytics consent is denied (the default)
  2. the tracker attaches some parameters to this hit, to help processing
    • a session is started
    • this is the first visit
  3. there is an item list on the page: view_item_list event is triggered
  4. the cookiebanner pops up (event: cookiebar_view)
  5. the visitor clicks accept (event: cookiebar_accept) and the tracker gets sent a granted signal
  6. now the cookie can be uses, and is attached to an automatic event user_engagement

Sounds simple. Now, let’s see what is streamed into Big Query:

The streaming data gap

Basically, the intraday tables store what happens, as it happens.

  • cookie field ( user_pseudo_id ) is filled in on hits on/after consent
  • cookie field is NULL for hits before consent

As it should be, right? But there’s a third bullet:

  • first batch of events will not appear in the intraday table!

Here’s what we see (most recent hit first, read from bottom to top)

  1. the page_view is missing in the streaming table
  2. the collected_traffic_source information is missing (it is always only filled in on the first batch of events)
  3. As a byproduct, we also do not see the session start and first visit
  4. the other events are all sent without a cookie
  5. after consent, we see the user_pseudo_id – finally

The next day.. Google has glued it all together

Processed data: every event has a row

The following is in the processed data: (most recent hit first, read from bottom to top)

  • The page_view event and all other events leading up to the consent have a cookie attached to it! Google rescued that information
  • the “Attached” parameters to the hit expand to two extra rows
    • session_start
    • first_visit
  • we have source information: collected_traffic_source is present – on the first batch, as normal

Not visible in the screenshot: session_traffic_source_last_click – the session information is properly filled in.

The consequences

If you decide to use intraday tables in your Big Query reports: be aware that although the information is fresh (no pun intended, GA360 users), it’s incomplete

  • intraday misses crucial events, namely the first batch (most often a page_view)
    • bye bye landing_page reports based on page_views
    • bye bye traffic source reports based on session_traffic_source_last_click or collected_traffic_source
  • intraday misses cookies on some events
    • which is not too much of an issue, really

Your experiences?

Do you use intraday tables in your models? Have you found clever workarounds to get the correct data in?

Let me know! Drop a comment here, or send me a bluesky message!

Still here?

Check out GA4Dataform – a product I’ve helped build that turns the GA4 Big Query exports into usable tables!

Related posts:

Google Analytics 4 truncates page locationMaking sense of Event Parameters in GA4Make your GA4 life easier: Some powertips!Smart incremental GA4 tables in Dataform

🚀 DataTalksClub's Data Engineering Zoomcamp Week 3 - BigQuery as a data warehousing solution.

🎯 For this week's module, we used Google's BigQuery to read Parquet files from a GCS bucket, and compare querying on regular, external and partitioned/clustered tables.

🔗 My answers to this module: github.com/goosethedev/de-zoom

FFS. Turns out (after I built a feature) that you can't supply a schema for BigQuery Materialised Views.

> Error: googleapi: Error 400: Schema field shouldn't be used as input with a materialized view, invalid

So it's impossible to have column descriptions for MVs? That sucks.

Whilst migrating our log pipeline to use the BigQuery Storage API & thus end-to-end streaming of data from Storage (GCS) via Eventarc & Cloud Run (read, transform, enrich - NodeJS) to BigQuery, I tested some big files, many times the largest we've ever seen in the wild.

It runs at just over 3 log lines/rows per millisecond end-to-end (i.e. inc. writing to BigQuery) over 3.2M log lines.

Would be interested to know how that compares with similar systems.

After several iterations, I think I've finally got my log ingest pipeline working properly, at scale, using the #BigQuery Storage API.
Some complications with migrating from the "legacy" "streaming" (it's not in the sense of code) API have been really hard to deal with e.g.:
* A single row in a write fail means the entire write fails
* SQL column defaults don't apply unless you specifically configure them to
* 10MB/write limit
I rewrote the whole thing today & finally things are looking good! 🤞