By Arbab Ahmed and Bruno Deszczynski
Black Friday and Cyber Monday—or as we like to call it, BFCM—is one of the largest sales events of the year. It’s also one of the most important moments for Shopify and our merchants. To put it into perspective, this year our merchants across more than 175 countries sold a record breaking $5.1+ billion over the sales weekend.
That’s a lot of sales. That’s a lot of data, too.
This BFCM, the Shopify data platform saw an average throughput increase of 150 percent. Our mission as the Shopify Data Platform Engineering (DPE) team is to ensure that our merchants, partners, and internal teams have access to data quickly and reliably. It shouldn’t matter if a merchant made one sale per hour or a million; they need access to the most relevant and important information about their business, without interruption. While this is a must all year round, the stakes are raised during BFCM.
Creating a data platform that withstands the largest sales event of the year means our platform services need to be ready to handle the increase in load. In this post, we’ll outline the approach we took to reliably scale our data platform in preparation for this high-volume event.
Data Platform Overview
Shopify’s data platform is an interdisciplinary mix of processes and systems that collect and transform data for use by our internal teams and merchants. It enables access to data through a familiar pipeline:
- Ingesting data in any format, from any part of Shopify. “Raw” data (for example, pageviews, checkouts, and orders) is extracted from Shopify’s operational tables without any manipulation. Data is then conformed to an Apache Parquet format on disk.
- Processing data, in either batches or streams, to form the foundations of business insights. Batches of data are “enriched” with models developed by data scientists, and processed within Apache Spark or dbt.
- Delivering data to our merchants, partners, and internal teams so they can use it to make great decisions quickly. We rely on an internal collection of streaming and serving applications, and libraries that power the merchant-facing analytics in Shopify. They’re backed by BigTable, GCS, and CloudSQL.
In an average month, the Shopify data platform processes about 880 billion MySQL records and 1.75 trillion Kafka messages.
As engineers, we want to conquer every challenge right now. But that’s not always realistic or strategic, especially when not all data services require the same level of investment. At Shopify, a tiered services taxonomy helps us prioritize our reliability and infrastructure budgets in a broadly declarative way. It’s based on the potential impact to our merchants and looks like this:
This service is critical externally, for example. to a merchant’s ability to run their business
This service is critical internally to business functions, e.g. a operational monitoring/alerting service
This service is valuable internally, for example, internal documentation services
This service is an experiment, in very early development, or is otherwise disposable. For example, an emoji generator
The highest tiers are top priority. Our ingestion services, called Longboat and Speedboat, and our merchant-facing query service Reportify are examples of services in Tier 1.
As we’ve mentioned, each BFCM the Shopify data platform receives an unprecedented volume of data and queries. Our data platform engineers did some forecasting work this year and predicted nearly two times the traffic of 2019. The challenge for DPE is ensuring our data platform is prepared to handle that volume.
When it comes to BFCM, the primary risk to a system’s reliability is directly proportional to its throughput requirements. We call it throughput risk. It increases the closer you get to the front of the data pipeline, so the systems most impacted are our ingestion and processing systems.
With such a titillating forecast, the risk we faced was unprecedented throughput pressure on data services. In order to be BFCM ready, we had to prepare our platform for the tsunami of data coming our way.
The Game Plan
We tasked our Reliability Engineering team with Tier 1 and Tier 2 service preparations for our ingestion and processing systems. Here’s the steps we took to prepare our systems most impacted by BFCM volume:
1. Identify Primary Objectives of Services
A data ingestion service's main operational priority can be different from that of a batch processing or streaming service. We determine upfront what the service is optimizing for. For example, if we’re extracting messages from a limited-retention Kafka topic, we know that the ingestion system needs to ensure, above all else, that no messages are lost in the ether because they weren’t consumed fast enough. A batch processing service doesn’t have to worry about that, but it may need to prioritize the delivery of one dataset versus another.
In Longboat’s case, as a batch data ingestion service, its primary objective is to ensure that a raw dataset is available within the interval defined by its data freshness service level objective (SLO). That means Longboat is operating reliably so long as every dataset being extracted is no older than eight hours— the default freshness SLO. For Reportify, our main query serving service, its primary objective is to get query results out as fast as possible; its reliability is measured against a latency SLO.
2. Pinpoint Service Knobs and Levers
With primary objectives confirmed, you need to identify what you can “turn up or down” to sustain those objectives.
In Longboat’s case, extraction jobs are orchestrated with a batch scheduler, and so the first obvious lever is job frequency. If you discover a raw production dataset is stale, it could mean that the extraction job simply needs to run more often. This is a service-specific lever.
Another service-specific lever is Longboat’s “overlap interval” configuration, which configures an extraction job to redundantly ingest some overlapping span of records in an effort to catch late-arriving data. It’s specified in a number of hours.
Memory and CPU are universal compute levers that we ensure we have control of. Longboat and Reportify run on Google Kubernetes Engine, so it’s possible to demand that jobs request more raw compute to get their expected amount of work done within their scheduled interval (ignoring total compute constraints for the sake of this discussion).
So, in pursuit of data freshness in Longboat, we can manipulate:
- Job frequency
- Longboat overlap interval
- Kubernetes Engine Memory/CPU requests
In pursuit of latency in Reportify, we can turn knobs like its:
- BigTable node pool size
- ProxySQL connection pool/queue size
3. Run Load Tests!
Now that we have some known controls, we can use them to deliberately constrain the service’s resources. As an example, to simulate an unrelenting N-times throughput increase, we can turn the infrastructure knobs so that we have 1/N the amount of compute headroom, so we’re at N-times nominal load.
For Longboat’s simulation, we manipulated its “overlap interval” configuration and tripled it. Every table suddenly looked like it had roughly three times more data to ingest within an unchanged job frequency; throughput was tripled.
For Reportify, we leveraged our load testing tools to simulate some truly haunting throughput scenarios, issuing an increasingly extreme volume of queries, as seen here:
In this graph, the doom is shaded purple.
Load testing answers a few questions immediately, among others:
- Do infrastructure constraints affect service uptime?
- Does the service’s underlying code gracefully handle memory/CPU constraints?
- Are the raised service alarms expected?
- Do you know what to do in the event of every fired alarm?
If any of the answers to these questions leave us unsatisfied, the reliability roadmap writes itself: we need to engineer our way into satisfactory answers to those questions. That leads us to the next step.
4. Confirm Mitigation Strategies Are Up-to-Date
A service’s reliability depends on the speed at which it can recover from interruption. Whether that recovery is performed by a machine or human doesn’t matter when your CTO is staring at a service’s reliability metrics! After deliberately constraining resources, the operations channel turns into a (controlled) hellscape and it's time to act as if it were a real production incident.
Talking about mitigation strategy could be a blog post on its own, but here are the tenets we found most important:
- Every alert must be directly actionable. Just saying “the curtains are on fire!” without mentioning “put it out with the extinguisher!” amounts to noise.
- Assume that mitigation instructions will be read by someone broken out of a deep sleep. Simple instructions are carried out the fastest.
- If there is any ambiguity or unexpected behavior during controlled load tests, you’ve identified new reliability risks. Your service is less reliable than you expected. For Tier 1 services, that means everything else drops and those risks should be addressed immediately.
- Plan another controlled load test and ensure you’re confident in your recovery.
- Always over-communicate, even if acting alone. Other engineers will devote their brain power to your struggle.
5. Turn the Knobs Back
Now that we know what can happen with an overburdened infrastructure, we can make an informed decision whether the service carries real throughput risk. If we absolutely hammered the service and it skipped along smiling without risking its primary objective, we can leave it alone (or even scale down, which will have the CFO smiling too).
If we don’t feel confident in our ability to recover, we’ve unearthed new risks. The service’s development team can use this information to plan resiliency projects, and we can collectively scale our infrastructure to minimize throughput risk in the interim.
In general, to be prepared infrastructure-wise to cover our capacity, we perform capacity planning. You can learn more about Shopify’s BFCM capacity planning efforts on the blog.
Overall, we concluded from our results that:
- Our mitigation strategy for Longboat and Reportify was healthy, needing gentle tweaks to our load-balancing maneuvers.
- We should scale up our clusters to handle the increased load, not only from shoppers, but also from some of our own fun stuff like the BFCM Live Map.
- We needed to tune our systems to make sure our merchants could track their online store’s performance in real-time through the Live View in the analytics section of their admin.
- Some jobs could use some tuning, and some of their internal queries could use optimization.
Most importantly, we refreshed our understanding of data service reliability. Ideally, it’s not any more exciting than that. Boring reliability studies are best.
We hope to perform these exercises more regularly in the future, so BFCM preparation isn’t particularly interesting. In this post we talked about throughput risk as one example, but there are other risks to data integrity, correctness, latency. We aim to get out in front of them too because data grows faster than engineering teams do. “Trillions of records every month” turns into “quadrillions” faster than you expect.
So, How’d It Go?
After months of rigorous preparation systematically improving our indices, schemas, query engines, infrastructure, dashboards, playbooks, SLOs, incident handling and alerts, we can proudly say BFCM 2020 went off without a hitch!
During the big moment we traced down every spike, kept our eyes glued to utilization graphs, and turned knobs from time to time, just to keep the margins fat. There were only a handful of minor incidents that didn’t impact merchants, buyers or internal teams - mainly self healing cases thanks to the nature of our platform and our spare capacity.
This success doesn’t happen by accident, it happens because of diligent planning, experience, curiosity and—most importantly—teamwork.
Interested in helping us scale and tackle interesting problems? We’re planning to double our engineering team in 2021 by hiring 2,021 new technical roles. Learn more here!