Keeping Developers Happy with a Fast CI

Imagine you just implemented an amazing new feature that will change the life of thousands of users. You fixed all the review suggestions from your awesome colleagues and are finally ready to merge and deploy. However, continuous integration (CI) defeats this plan by revealing a minor bug. And because CI is just too slow, releasing this feature now has to wait till tomorrow.

If this sounds familiar, you're not alone. At Shopify, we run automated CI, like executing tests or code linting, on every git push. While we all agree that this should be fast, it's not an easy task if you have more than 170,000 tests to execute. Our developers were frustrated and that was the reason to run a dedicated project improving the speed of Shopify’s CI. I’ll show you how the Test Infrastructure team was able to reduce the p95 of Shopify’s core monolith CI from 45 minutes to 18.

The Test Infrastructure team is responsible for ensuring Shopify’s CI systems are scalable, robust, and usable. Our team set the objective: run the CI of Shopify's Core monolith in under 10 minutes in the 95th percentile. 95th percentile means that 95% of all builds are faster than 10 minutes. This was an ambitious goal because our CI's 95th percentile was around 45 minutes.

Architecture Overview

It's important to mention that our CI runs on Buildkite. Buildkite gives us the flexibility to run the CI servers in our own cloud infrastructure. This has several advantages like aggressive scaling, support of different architectures and better integration and customization. For instance, it allowed us to implement our own instrumentation framework which was crucial for the success of this project.

In Buildkite, a pipeline is a template of the steps you want to run. There are many types of steps, some run shell commands, some define conditional logic, and others wait for user input. When you run a pipeline, a build is created. Each of the steps in the pipeline end up as jobs in the build, which then get distributed to available agents. The agents are installed on our CI servers and polling Buildkite for work, running build jobs and reporting back the results. Each server runs several of these agents at the same time and we scale the number of servers based on demand throughout the day.

Setting Priorities With Data Driven Development

There’s a popular carpenter saying: Measure twice and cut once. In other words, one should double check one's measurements for accuracy before cutting a piece of wood. Otherwise it may be necessary to cut again, wasting time and material. At the beginning, we took this saying to heart and invested in setting up instrumentation for our CI. We had a good foundation, as we already measured job and build times. However, this only gave us an idea that there was something slow, but not what was slow. On top of that, we needed to instrument every command executed.

A scatter plot of commands execution count vs average duration
Scatter Plot with the Two Dimensions: Execution Time and Number of Executions per Command

With this instrumentation in place, we built a scatter plot with the two dimensions: execution time and number of executions per command. The dots in the top right corner are the commands that take the most time and get executed the most (our top priority). This information was tremendously important to setting priorities, and we discovered three main areas to focus on: preparing agents, building dependencies, and executing the actual tests.

Improving Docker Start Time by Reducing I/O Bottlenecks

We grouped things like downloading the source code, restoring caches or starting docker containers for services like MySQL under preparing agents. These commands consumed around 31% of the time spent in CI, almost 1/3. This was a huge area to improve.

One bottleneck we discovered quickly was that starting the docker containers sometimes took up to 2 minutes. Our first assumption was that the CI machines were underprovisioned. It's important to know that we run several Buildkite agents on each testing machine (sometimes running more than 50 containers per machine). Our first experiment was to reduce the number of agents we scheduled on each machine that indeed reduced the time it took to start a docker container to a few seconds. However, one of our goals was to not increase the cloud computing expenses by more than 10%. Running more machines would have blown our budget, so this was not an option for now!

After more debugging, we tracked down that disk I/O was the bottleneck for starting docker containers. Right before starting containers, cached directories are downloaded and written to the disk. These directories include compiled assets or bundled gems. They often reach more than 10GB per machine. You can set a percentage of system memory that can be filled with "dirty" pages—memory pages that still need to be written to disk—before a background process kicks in to write them to disk. Whenever this threshold is reached, which is most of the time after downloading caches, I/O is blocked until the dirty pages are synced. The slow start of docker containers wasn’t the actual problem, but a symptom of another problem.

Line Chart of Docker Start Time Improvements Dropping from 125 Seconds to 25 Seconds on p95
Line Chart of Docker Start Time Improvements Dropping from 125 Seconds to 25 Seconds on p95

Once we knew the root cause, we implemented several fixes. The disk size was increased and the write speed increased proportionally with it. Additionally, we mounted most of the caches as read-only. The advantage is that we share read-only caches between agents, only needing to download and write them once per machine. Our p95 for starting containers improved from 90 seconds to 25 seconds, almost 4 times faster! We now write less data a lot faster.

The Fastest Code Is the Code That Doesn’t Run

While the improvements for preparing the test execution benefited all of Shopify, we also implemented improvements specifically to the Rails monolith most of Shopify's engineers work on (my team’s main focus). Like most Rails apps, before we can run any tests we need to prepare dependencies like compiling assets, migrating the database, and running bundle install. These tasks were responsible for consuming about 37% of the time spent in CI. Combined with preparing agents it meant that 68% of the time in CI was spent just on overhead before we actually ran any test! To improve the speed of building dependencies, we actually didn’t optimize our code. Instead we tried not to run the code at all! Or to quote Robert Galanakis: “The fastest code is the code that doesn’t run.”

How did we do that? We know that only a small number of pull requests change the database or assets (most of the Shopify frontend code is in separate repositories). For database migrations, we calculate an MD5 hash of our structure.sql file and db/migrate folder. If the hash is the same as our cache, we don't need to load the Rails app and run db:migrate. A similar approach was implemented for asset compilation. On top of that, we also run these steps in parallel that resulted in an improvement from 5 minutes to around 3 minutes only for this job.

The 80/20 Rule Applies to Tests Too

After preparing agents and building dependencies, the remaining time is spent on running tests. In Shopify Core, we have more than 170,000 tests, which grow annually by 20-30%. It takes more than 41 hours to run all tests on a single machine. To put this into context, watching all 23 Avengers movies takes around 50 hours. The sheer amount of tests and their growth makes it unrealistic to optimize the tests themselves.

Around one year ago we released a system to only run tests related to the code change. Rather than running all 170,000 tests on every pull request we’d only run a subset of our test suite. In the beginning, it was merely an approach to fight flaky tests but it also reduced the test execution time significantly. While the original implementation turned out to be very reliable, it mostly focused on Ruby files. However, other non-Ruby files like JSON or YAML files also change frequently and often would trigger a full test run. It was time to go back and improve the original implementation.

For instance, the initial implementation of our test selection ignored changes to ActiveRecord fixtures that meant a change to a fixture file would always trigger a full test run. As changes to fixture files are quite frequent and have low risk of breaking production, we decided to create a test mapping for them. Luckily ActiveRecord already creates an ActiveSupport notification for every SQL query it executes. By subscribing to these events, we were able to create a mapping of which fixtures were used in which tests. We can look up the tests we need to run by looking at what files have been modified, added, removed, or renamed. This change, along with several other new test mappings, increased the percentage of builds that didn't select all tests from 45% to over 60%. A nice side effect was that the test stability also increased from 88% to 97%.

With all these changes in place, we also noticed that a small percentage of tests are responsible for the slowest CI builds. This corresponds with the Pareto principle which states, “that for many outcomes roughly 80% of consequences come from 20% of the causes.” For instance, we discovered that one test frequently hangs and causes CI to timeout. Although we already wrap each test into a Ruby timeout block, it wasn’t 100% reliable. Unfortunately, my team doesn’t have the necessary context and capacity to fix broken tests. Sometimes we even have to disable tests if they cause too much “harm” to Shopify and other developers. This is of course a last resort, and we always notify the original authors to give them the opportunity to investigate and come up with a fix. In this case, by temporarily removing these tests we improved the p95 by 10 minutes from around 44 minutes to 34.

Keep Developers Happy

Build Time Distribution Over Time From Start to End of the Project
Build Time Distribution Over Time From Start to End of the Project

Slow CI systems are often responsible for making frustrated developers. Keeping these systems fast requires an ongoing effort. However, before jumping into implementing performance tweaks, it's important to set up a solid monitoring system. Having insights into your CI is crucial in spotting and fixing bottlenecks. After discovering a potential problem, it's always helpful to do a root cause analysis to differentiate between symptom and problem. While it might be quicker to fix the symptom, it will hide the underlying problems and cause more issues in the long run. Although optimizing code might be fun, it's sometimes better to skip or remove the code altogether. Last but not least, it's time to improve your test suite. Focus on the slowest 20% and you will be surprised how much impact they have on the whole test suite. By combining these principles together, my team was able to reduce the p95 of Shopify’s core monolith from 45 minutes to 18. Our developers now spend less time waiting and ship faster.

Shipit! Presents: Keeping Developers Happy with a Fast CI

Join Christian, Jessica, Kim, and Eduardo, as they talk about improving the speed of Shopify’s CI. They reduced the p95 of Shopify’s core monolith CI from 45 minutes to 18.

Christian Bruckmayer is originally from Nuremberg, Germany, but now calls Bristol, UK, his home. He just recently joined Shopify and is now a member of the Test Infrastructure team which is responsible for ensuring Shopify’s CI systems are scalable, robust, and usable. Since 2014 he is an avid open source contributor and Ruby is his tool of choice. If you want to learn more about Christian get in touch with him on Twitter.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.