Pummelling the Platform–Performance Testing Shopify

Developing a product or service at Shopify requires care and consideration. When we deploy new code at Shopify, it’s immediately available for merchants and their customers. When over 1 million merchants rely on Shopify for a successful Black Friday Cyber Monday (BFCM), it’s extremely important that all merchants—big and small—can run sales events without any surprises.

We typically classify a “sales event” as any single event that attracts a high amount of buyers to Shopify merchants. This could be a product launch, a promotion, or an event like BFCM. In fact, BFCM 2020 was the largest single sales event that Shopify has ever seen, and many of the largest merchants on the planet also saw some of the biggest flash sales ever observed before on Earth.

In order to ensure that all sales are successful, we regularly and repeatedly simulate large sales internally before they happen for real. We proactively identify and eliminate issues and bottlenecks in our systems using simulated customer traffic on representative test shops. We do this dozens of times per day using a few tools and internal processes.

I’ll give you some insight into the tools we use to raise confidence in our ability to serve large sales events. I’ll also cover our experimentation and regression framework we built to ensure that we’re getting better, week-over-week, at handling load.

We use “performance testing” as an umbrella term that covers different types of high-traffic testing including (but not limited to) two types of testing that happen regularly at Shopify: “load testing” and “stress testing”. 

Load testing verifies that a service under load can withstand a known level of traffic or specific number of requests. An example load test is when a team wants to confirm that their service can handle 1 million requests per minute for a sustained duration of 15 minutes. The load test will confirm (or disconfirm) this hypothesis.

Stress testing, on the other hand, is when we want to understand the upper limit of a particular service. We do this by increasing the amount of load—sometimes very quicklyto the service being tested until it crumbles under pressure. This gives us a good indication of how far individual services at Shopify can be pushed, in general.

We condition our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like BFCM. Through proactive load tests and stress tests, we have a really good picture of what to expect even before a flash sale kicks off. 

Enabling Performance Testing at Scale

A platform as big and complex as Shopify has many moving parts, and each component needs to be finely tuned and prepared for large sales events. Not unlike a sports car, each individual part needs to be tested under load repeatedly to understand performance and throughput capabilities before assembling all the parts together and taking the entire system for a test drive. 

Individual teams creating services at Shopify are responsible for their own performance testing on the services they build. These teams are best positioned to understand the inner workings of the services they own and potential bottlenecks or situations that may be overwhelmed under extreme load, like during a flash sale. To enable these teams, performance testing needs to be approachable, easy to use and understand, and well-supported across Shopify. The team I lead is called Platform Conditioning, and our mission is to constantly improve the tooling, process, and culture around performance testing at Shopify. Our team makes all aspects of the Shopify platform stronger by simulating large sales events and making high-load events a common and regular occurrence for all developers. Think of Platform Conditioning as the personal trainers of Shopify. It’s Platform Conditioning that can help teams develop individualized workout programs and set goals. We also provide teams with the tools they need in order to become stronger.

Generating Realistic Load

At the heart of all our performance testing, we create “load”. A service at Shopify will add load to to cause stresses that—in the end—make it stronger, by using requests that hit specific endpoints of the app or service.

Not all requests are equal though, and so stress testing and load testing are never as easy as tracking the sheer volume of requests received by a service. It’s wise to hit a variety of realistic endpoints when testing. Some requests may hit CDNs or layers of caching that make the response very lightweight to generate. Other requests, however, can be extremely costly and include multiple database writes, N+1 queries, or other buried treasures. It’s these goodies that we want to find and mitigate up front, before a sales event like BFCM 2020.

For example, a request to a static CSS file is served from a CDN node in 40ms without creating any load to our internal network. Comparatively, making a search query on a shop hits three different layers of caching and queries Redis, MySQL, and Elasticsearch with total round-trip time taking 1.5 seconds or longer.

Another important factor to generating load is considering the shape of the traffic as it appears at our load balancers. A typical flash sale is extremely spiky and can begin with a rush of customers all trying to purchase a limited product simultaneously. It’s very important to simulate this same traffic shape when generating load and to run our tests for the same duration that we would see in the wild.

A flow diagram showing how we generate load with go-lua
A systems diagram showing how we generate load with go-lua

When generating load we use a homegrown, internal tool that generates raw requests to other services and receives responses from them. There are two main pieces to this tool: the first is the coordinator, and the second is the group of workers that generate the load. Our load generator is written in Go and executes small scripts written in Lua called “flows”. Each worker is running a Go binary and uses a very fast and lightweight Lua VM for executing the flows. (The Go-Lua VM that we use is open source and can be found on Github) Through this, the steps of a flow can scale to issue tens of millions of requests per minute or more. This technique stresses (or overwhelms) specific endpoints of Shopify and allows us to conduct formal tests using the generated load.

We use our internal ChatOps tool, ‘Spy’, to enqueue tests directly from Slack, so everyone can quickly see when a load test has kicked off and is running. Spy will take care of issuing a request to the load generator and starting a new test. When a test is complete, some handy links to dashboards, logs, and overall results of the test are posted back in Slack.

Here’s a snippet of a flow, written in Lua, that browses a Shopify storefront and logs into a customer account—simulating a real buyer visiting a Shopify store:

Just like a web browser, when a flow is executing it sends and receives headers, makes requests, receives responses and simulates many browser actions like storing cookies for later requests. However, an executing flow won’t automatically make subsequent requests for assets on a page and can’t execute Javascript returned by the server. So our load tests don’t make any XMLHttpRequest (XHR) requests, Javascript redirects or reloads that can happen in a full web browser.

So our basic load generator is extremely powerful for generating a great deal of load, but in its purest form it only can hit very specific endpoints as defined by the author of a flow. What we create as “browsing sessions” are only a streamlined series of instructions and only include a few specific requests for each page. We want all our performance testing as realistic as possible, simulating real user behaviour in simulated sales events and generating all the same requests that actual browsers make. To accomplish this, we needed to bridge the gap between scripted load generation and realistic functionality provided by real web browsers.

Simulating Reality with HAR-based Load Testing

Our first attempt at simulating real customers and adding realism to our load tests was an excellent idea, but fairly naive when it came to how much computing power it would require. We spent a few weeks exploring browser-based load testing. We researched tools that were already available and created our own using headless browsers and tools like Puppeteer. My team succeeded in making realistic browsing sessions, but unfortunately the overhead of using real browsers dramatically increased both computing costs and real money costs. With browser-based solutions, we could only drive thousands of browsing sessions at a time, and Shopify needs something that can scale to tens of millions of sessions. Browsers provide a lot of functionality, but they come with a lot of overhead. 

After realizing that browser-based load generation didn’t suit our needs, my team pivoted. We were still driving to add more realism to our load tests, and we wanted to make all the same requests that a browser would. If you open up your browser’s Developer Tools, and look at the Network tab while you browse, you see hundreds of requests made on nearly every page you visit. This was the inspiration for how we came up with a way to test using HTTP Archive (HAR) files as a solution to our problems.

An image of chrome developer tools showing a small sample of requests made by a single product page
A small sample of requests made by a single product page

HAR files are detailed JSON representations of all of the network requests and responses made by most popular browsers. You can export HAR files easily from your browser, or web proxy tools like Charles Proxy. A single HAR file includes all of the requests made during a browsing session and are easy to save, examine, and share. We leveraged this concept and created a HAR-based load testing solution. We even gave it a tongue-and-cheek name: Hardy Har Har.

Hardy Har Har (or simply HHH for those who enjoy brevity) bridges the gap between simple, lightweight scripted load tests and full-fledged, browser-based load testing. HHH will take a HAR file as input and extract all of the requests independently, giving the test author the ability to pick and choose which hostnames can be targeted by their load test. For example, we nearly always remove requests to external hostnames like Google Analytics and requests to static assets on CDN endpoints (They only add complexity to our flows and don’t require load testing). The resulting output of HHH is a load testing flow, written in Lua and committed into our load testing repository in Git. Now—literally at the click of a button—we can replay any browsing session in its full completeness. We can watch all the same requests made by our browser, scaled up to millions of sessions.

Of course, there are some aspects of a browsing session that can’t be simply replayed as-is. Things like logging into customer accounts and creating unique checkouts on a Shopify store need dynamic handling that HHH recognizes and intelligently swaps out the static requests and inserts dynamic logic to perform the equivalent functionality. Everything else lives in Lua and can be ripped apart or edited manually giving the author complete control of the behaviour of their load test. 

Taking a Scientific Approach to Performance Testing

The final step to having great performance testing leading up to a sales event is clarity in observations and repeatability of experiments. At Shopify, we ship code frequently, and anyone can deploy changes to production at any point in time. Similarly, anyone can kick off a massive load test from Slack whenever they please. Given the tools we’ve created and the simplicity in using them, it’s in our best interest to ensure that performance testing follows the scientific method for experimentation.

A flow diagram showing the performance testing scientific method of experimentation
Applying the scientific method of experimentation to performance testing

Developers are encouraged to develop a clear hypothesis relating to their product or service, perform a variety of experiments, observe the results of various experiment runs, and formulate a conclusion that relates back to their hypothesis. 

All this formality in process can be a bit of a drag when you’re developing, so the Platform Conditioning team created a framework and tool for load test experimentation called Cronograma. Cronograma is an internal Rails app making it easy for anyone to set up an experiment and track repeated runs of a performance testing experiment. 

Cronograma enforces the formal use of experiments to track both stress tests and load tests. The Experiment model has several attributes, including a hypothesis and one or more orchestrations that are coordinated load tests executed simultaneously in different magnitudes and durations. Also, each experiment has references to the Shopify stores targeted during a test and links to relevant dashboards, tracing, and logs used to make observations.

Once an experiment is defined, it can be run repeatedly. The person running an experiment (the runner) starts the experiment from Slack with a command that creates a new experiment run. Cronograma kicks off the experiment and assigns a dedicated Slack channel for the tests allowing multiple people to participate. During the running of an experiment any number of things could happen including exceptions, elevated traffic levels, and in some cases, actual people may be paged. We want to record all of these things. It’s nice to have all of the details visible in Slack, especially when working with teams that are Digital by Default. Observations can be made by anyone and comments are captured from Slack and added to a timeline for the run. Once the experiment completes, the experiment runner terminates the run and logs a conclusion based on their observations that relates back to the original hypothesis.

We also included additional fanciness in Cronograma. The tool automatically detects whether any important monitors or alerts were triggered during the experiment from internal or third-party data monitoring applications. Whenever an alert is triggered, it is logged in the timeline for the experiment. We also retrieve metrics from our data warehouse automatically and consume these data in Cronograma allowing developers to track observed metrics between runs of the same experiment. For example:

  • the response times of the requests made
  • how many 5xx errors were observed
  • how many requests per minute (RPM) were generated

All of this information is automatically captured so that running an experiment is useful and it can be compared to any other run of the experiment. It’s imperative to understand whether a service is getting better or worse over time.

Cronograma is the home of formal performance testing experiments at Shopify. This application provides a place for all developers to conduct experiments and repeat past experiments. Hypotheses, observations, and conclusions are available for everyone to browse and compare to. All of the tools mentioned here have led to numerous performance improvements and optimizations across the platform, and they give us confidence that we can handle the next major sales event that comes our way.

The Best Things Go Unnoticed 

Our merchants count on Shopify being fast and stable for all their traffic—whether they’re making their first sale, or they’re processing tens of thousands of orders per hour. We prepare for the worst case scenarios by proactively testing the performance of our core product, services, and apps. We expose problems and fix them before they become a reality for our merchants using simulations. By building a culture of load testing across all teams at Shopify, we’re prepared to handle sales events like BFCM and flash sales. My team’s tools make performance testing approachable for every developer at Shopify, and by doing so, we create a stronger platform for all our merchants. It’s easy to go unnoticed when large sales events go smoothly. We quietly rejoice in our efforts and the realization that it’s through strength and conditioning that we make these things possible.


Performance Scale at Shopify

Join Chris as he chats with Anita about how Shopify conditions our platform through performance testing on a massive scale to ensure that all components of Shopify’s platform can withstand the rush of customers trying to purchase during sales events like flash sales and Black Friday Cyber Monday.  

Chris Inch is a technology leader and development manager living in Kitchener, Ontario, Canada. By day, he manages Engineering teams at Shopify, and by night, he can be found diving head first into a variety of hobbies, ranging from beekeeping to music to amateur mycology.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone