During the infrastructural exploration of a pipeline my team was building, we discovered a query that could have cost us nearly $1 million USD a month in BigQuery. Below, we’ll detail how we reduced this and share our tips for lowering costs in BigQuery.
Processing One Billion Rows of Data
My team was responsible for building a data pipeline for a new marketing tool we were shipping to Shopify merchants. We built our pipeline with Apache Flink and launched the tool in an early release to a select group of merchants. Fun fact: this pipeline became one of the first productionized Flink pipelines at Shopify. During the early release, our pipeline ingested one billion rows of data into our Flink pipeline's internal state (managed by RocksDB), and handled streaming requests from Apache Kafka.
We wanted to take the next step by making the tool generally available to a larger group of merchants. However, this would mean a significant increase in the data our Flink pipeline would be ingesting. Remember, our pipeline was already ingesting one billion rows of data for a limited group of merchants. Ingesting an ever-growing dataset wouldn’t be sustainable.
As a solution, we looked into a SQL-based external data warehouse. We needed something that our Flink pipeline could submit queries to and that could write back results to Google Cloud Storage (GCS). By doing this, we could simplify the current Flink pipeline dramatically by removing ingestion, ensuring we have a higher throughput for our general availability launch.
The external data warehouse needed to meet the following three criteria:
- Atomically load the parquet dataset easily
- Handle 60 requests per minute (our general availability estimation) without significant queuing or waiting time
- Export the parquet dataset to GCS easily
The first query engine that came to mind was BigQuery. It’s a data warehouse that can both store petabytes of data and query those datasets within seconds. BigQuery is fully managed by Google Cloud Platform and was already in use at Shopify. We knew we could load our one billion row dataset into BigQuery and export query results into GCS easily. With all of this in mind, we started the exploration but we met an unexpected obstacle: cost.
A Single Query Would Cost Nearly $1 Million
As mentioned above, we’ve used BigQuery at Shopify, so there was an existing BigQuery loader in our internal data modeling tool. So, we easily loaded our large dataset into BigQuery. However, when we first ran the query, the log showed the following:
total bytes processed: 75462743846, total bytes billed: 75462868992
That roughly translated to 75 GB billed from the query. This immediately raised an alarm because BigQuery is charged by data processed per query. If each query were to scan 75 GB of data, how much would it cost us at our general availability launch?
I quickly did some rough math. If we estimate 60 RPM at launch, then:
60 RPM x 60 minutes/hour x 24 hours/day x 30 days/month = 2,592,000 queries/month
If each query scans 75 GB of data, then we’re looking at approximately 194,400,000 GB of data scanned per month. According to BigQuery’s on-demand pricing scheme, it would cost us $949,218.75 USD per month!
Clustering to the Rescue
With the estimation above, we immediately started to look for solutions to reduce this monstrous cost.
We knew that clustering our tables could help reduce the amount of data scanned in BigQuery. As a reminder, clustering is the act of sorting your data based on one or more columns in your table. You can cluster columns in your table by fields like DATE
, GEOGRAPHY
, TIMESTAMP
, ect. You can then have BigQuery scan only the clustered columns you need.
With clustering in mind, we went digging and discovered several condition clauses in the query that we could cluster. These were ideal because if we clustered our table with columns appearing in WHERE
clauses, we could apply filters in our query that would ensure only specific conditions are scanned. The query engine will stop scanning once it finds those conditions, ensuring only the relevant data is scanned instead of the entire table. This reduces the amount of bytes scanned and would save us a lot of processing time.
We created a clustered dataset on two feature columns from the query’s WHERE
clause. We then ran the exact same query and the log now showed 508.1 MB billed. That’s 150 times less data scanned than the previous unclustered table.
With our newly clustered table, we identified that the query would now only scan 108.3 MB of data. Doing some rough math again:
2,592,000 queries/month x 0.1 GB of data = 259,200 GB of data scanned/month
That would bring our cost down to approximately $1,370.67 USD per month, which is way more reasonable.
Other Tips for Reducing Cost
While all it took was some clustering for us to significantly reduce our costs, here are a few other tips for lowering BigQuery costs:
-
Avoid the
SELECT*
statement: Only select the columns in the table you need queried. This will limit the engine scan to only those columns, therefore limiting your cost. - Partition your tables: This is another way to restrict the data scanned by dividing your table into segments (aka partitions). You can create partitions in BigQuery based on time-units, ingestion time or integer range.
- Don’t run queries to explore or preview data: Doing this would be an unnecessary cost. You can use table preview options to view data for free.
And there you have it. If you’re working with a high volume of data and using BigQuery, following these tips can help you save big. Beyond cost savings, this is critical for helping you scale your data architecture.
Calvin is a senior developer at Shopify. He enjoys tackling hard and challenging problems, especially in the data world. He’s now working with the Return on Ads Spend group in Shopify. In his spare time, he loves running, hiking and wandering in nature. He is also an amateur Go player.
Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.