By Kathryn Tang and Kir Shatrov
The fourth Thursday in November is Thanksgiving in the United States. The day after, Black Friday (coined in 1961), is the first day of the Christmas shopping season and since 2005 it’s the busiest shopping day of the year in North America. Cyber Monday is a more recent development. Getting its name in 2005, it refers to the Monday after the Thanksgiving weekend where retailers focus on sales offered online. At Shopify, we call the weekend including Black Friday and Cyber Monday BFCM.
From the engineering team’s point of view, every BFCM challenges the platform and all the things we’ve shipped throughout the year:
- Would our clusters handle two times the number of virtual machines?
- Would we hit some sort of limitation on the new network design?
- Would the new logging pipeline handle such an increase in traffic?
- What’s going to be the next scalability bottleneck that we hit?
The other challenge is planning the capacity. We need to understand the magnitude of traffic ahead of us, and how many resources like CPUs and storage we’ll need to handle BFCM sales. On top of that, we need to have enough room in case of something unexpected, and we need to perform a regional failover.
Since 2017, we’ve partnered with Google Cloud Platform (GCP) as our main vendor for the cloud. Over these years, we’ve worked closely with their team on our capacity models, and prior to every BFCM that collaboration gets even closer.
In this post, we’ll cover our approaches to capacity planning, and how we rolled it out across the org and to dozens of teams. We’ll also share how we validated our capacity plans with scalability tests to make sure they work.
Our Google Cloud resource needs depend on how much traffic our merchants see during BFCM. We worked with our data scientists to forecast traffic levels and set those levels as a bar for our platform to scale to. Additionally, we looked into historical numbers, applied a safety margin, and projected how many buyers would check out or view online stores.
We created a master resourcing plan for our Google Cloud implementation and estimated how things like CPUs and storage would scale to BFCM traffic levels. Owners for our top 10 or so resource areas were tasked to estimate what they needed for BFCM. These estimates were detailed breakdowns of the machine types, geographic locations, and quantities of resources like CPUs. We also added buffers to our overall estimates to allow flexibility to change our resourcing needs, move machines across projects, or failover traffic to different regions if we needed to. What also helps is that we partition each component into a separate GCP project, which makes it a lot easier to think of quotas per every project.
2020 is an exceptionally difficult year to plan for. Normally, we’d look at BFCM trends from years prior and predict BFCM traffic with a fairly high level of confidence. This year, COVID-19 lockdowns drove a rapid shift to selling online this spring, and we didn't know what to expect. Would we see a massive increase in online traffic this BFCM, or a global economic depression where consumers stopped buying much at all? To manage heightened uncertainty, we forecasted multiple scenarios and their respective needs for our cloud deployment.
From an investment perspective, planning for the largest scale scenario means spending a lot of money very quickly to handle sales that might not happen. Alternatively, not deploying enough machines means having too little computing power and putting our merchant storefronts at risk of outages. It was absolutely vital to avoid anything that would put our merchants at risk of downtime. We decided to scale to our more aggressive growth scenarios to ensure our platform is stable regardless of what happens. We’re transparent with our partners, finance teams, and internal teams about how we thought through these scenarios which helps them make their own operating decisions.
A sheet with a capacity plan is just a starting point. Once we start scaling to projected numbers, there’s a high chance that we’ll hit limits throughout our tech stack that need resolving. In a complex system, there’s always a limit like:
- the number of VMs in a network
- the number of packets that a busy Memcached server can accept
- the number of MB/s your logging pipeline can handle.
Historically, every BFCM brought us some scalability surprises, and what’s worse, we’d only notice them when fully scaled prior to BFCM. That left too little time to come up with mitigation plans.
Back in 2018, we decided that a “faux” BFCM in the middle of the year would increase our resilience as an organization and push us to find unknowns that we’d otherwise only discover during the real thing. As we started doing that, it allowed us to find problems at scale more often and created that mental muscle of preparing for critical events and finding unknowns. If you’re exercising and something feels hard, you train more and eventually your muscles get better. Shopify treats BFCM the same way.
We’ve started the practice of regular scale-up testing at Shopify, and of course we made sure to come up with fun names for each. We’ve had Mayday (2019), Spooky scale-up (2019), and Oktoberfest scale-up (2020). Another fun fact is that our Waterloo teams play a large part in running this testing, and the dates of our Oktoberfest matched the city of Kitchener-Waterloo’s Oktoberfest festivities (It’s the second-largest Oktoberfest in the world).
Oktoberfest scale-up’s goal was to simulate this year’s expected BFCM load based on the traffic forecasts from the data science team. And the fact that we run Shopify in cloud on Google Kubernetes Engine allowed us to grab extra compute capacity just for the window of the exercise, and only pay for those hours when we needed it.
Investment in our internal load testing tooling over the years is fundamental to our ability to run such large scale, platform-wide load tests. We’ve talked about go-lua, an open source project that powers our load testing tool. Thanks to embedded Lua, we feed it with a high-level set of steps for what we want to test: actions like browsing the storefront, adding a product to a card, proceeding to check out, and processing the transaction through a mock payment gateway.
Thanks to Oktoberfest scale-up, we identified and then fixed some bottlenecks that could have become an issue for the real BFCM. Doing the test in early October gave us time to address issues.
After addressing all the issues, we repeated the scale-up test to see how our mitigations helped. Seeing that going smoothly increased our confidence levels about the upcoming Black Friday and reduced stress levels for all teams.
We strive for a smooth BFCM and spend a lot of time preparing for it, from capacity planning, to setting the expectations for our vendors, to load testing, and failover simulations. Beyond delivering a smooth holiday season for our merchants, BFCM is time to reflect on the future. As Shopify continues to grow, BFCM traffic levels can become the normal everyday loads we see in the next year. Our job is to bring lessons from events like BFCM to make our systems even more automated, more dynamic, and more resilient. We relish this opportunity to think about where Shopify is going and to architect our platform to scale with it.
Kir Shatrov is an Engineering Lead who’s been with Production Engineering at Shopify for the past five years, working on areas from CI/CD infrastructure to sharding and capacity planning.
Kathryn Tang is an Engineering Program Manager who manages our Google Cloud relationship. She has been at Shopify for 4 years, working with a multitude of R&D and commercial teams to derive business insights and guide operating decisions to help us scale.
We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.