Developers write tests to ensure correctness and allow future changes to be made safely. However, as the number of features grows, so does the number of tests. Tests are a double-edged sword. On one hand, well-written ones catch bugs and maintain a program’s stability, but as the code base grows, a high number of tests impedes scalability because they take a long time to run and increase the likelihood of intermittently failing tests. Software projects often require all tests to pass before merging to the main branch. This adds overhead to all developers. Intermittently failing tests worsen the problem. Some causes of intermittently failing tests are

timing
instability in the database
HTTP connections/mockings
random generators
tests that leak state to other tests: the test passes every single time by itself, but fails other tests depending on the order.

Unfortunately, one can’t fully eradicate intermittently failing tests, and the likelihood of them occurring increases as the codebase grows. They make already slow test suites even slower, because now you have to retry them until they pass.

I’m not implying that one shouldn’t write tests. The benefits of quality assurance, performance monitoring, and speeding up development by catching bugs early instead of in production outweigh its downsides. However, improvements can be made. My team thus embarked on a journey of making our continuous integration (CI) more stable and faster. I’ll share the dynamic analysis system to select tests that we implemented, followed by other approaches we explored but decided against. Test selection sparks joy in my life. I wish that I can bring the same joy to you.

Problems with Tests at Shopify

Tests impede developers’ productivity here. The test suite of our monolithic repository:

has over 150,000 tests
is growing by 20-30% in size annually
takes about 30-40 min to run on hundreds of docker containers in parallel.

Each pull request requires all tests to pass. Developers have to either wait for tests or pay for the price of context switching. In our bi-annual survey, build stability and speed is a recurring topic. So this problem is clearly felt by our developers.

Solving the Problem with Tests

There’s an abundance of blog posts/articles/research papers on optimizing code, unfortunately few on tests. In fact, we learned that it’s unrealistic to optimize tests because of the sheer quantity and growth. We also learned that this is an uncharted territory for many companies.

As our research progressed, it became apparent that the right solution was to only run the tests related to the code change. This was challenging for a large, dynamically typed Ruby codebase that makes ample use of the language flexibility. Furthermore, the difficulty was exacerbated by metaprogramming in Rails as well as other non-Ruby files in the code base that affect how the application behaves, for example, YAML, JSON, and JavaScript.

What Is Dynamic Analysis?

Dynamic analysis, in essence, is logging every single method call. We run each test and track all the files in the call graph. Once we have the call graphs, we create a test mapping: for every file, we find what tests have that file in its call graph. By looking at what files have been modified, added, removed, or renamed, we can look up the tests we need to run.

You can check out Rotoscope and Tracepoint to record call graphs for Ruby applications.

Why Use Dynamic Analysis?

Ruby is a dynamically typed language, we can’t retrieve a dependency graph using static analysis. Thus, we don’t know the corresponding tests for the code.

Downsides of Running Dynamic Analysis on Ruby on Rails

1. It’s Slow.

It’s computationally intensive to generate the call graphs, and we can’t run it for every single PR. Instead, we run dynamic analysis on every deployed commit.

2. Mapping Can Lag Behind HEAD

The generated mapping lags behind the head of the main because it runs asynchronously. To solve this problem, we run all tests that satisfy at least one of the following criteria:

added or modified on the current branch
added or modified between the head of the last generated mapping and current branch head
mapped tests per current branch’s code change

3. There Are Untraceable Files

There are non-Ruby files such as YAML, JSON, etc. that can’t be traced on the call graph. We added custom patches to Rails to trace some of them. For example, we patched the I18n::Backend class to trace the translation files in YAML. For changes to files that haven’t been traced, we run every single test.

4. Metaprogramming Obfuscates Call Paths

We added existing metaprogramming in a known directory and added glob rules on the file path to determine which tests to run. We discourage new metaprogramming through Sorbet and other linters.

5. Some Failing Tests Won’t Be Caught

The generated mapping from dynamic analysis can get out of date with the latest main, and sometimes failing tests won’t get selected and get merged to main. To circumvent the issue, we run the full test suite every deploy and automatically disable any failing test, so other developers won’t be blocked from shipping their code. The full test suite runs asynchronously. Pull requests can get merged before the full test suite completes.

Automatic disabling of failing tests sounds counterintuitive to many people. From what we observed, the percentage of pull requests with failing tests being merged to the main branch is about 0.06%. We also have other mechanisms to mitigate the risks, such as canary deploys and type checking using Sorbet. The code owners of the failing tests are notified. We expect developers to fix or remove the failures without blocking future deploys.

How Was the Dynamic Analysis Rolled Out?

In the experimentation phase of the new dynamic analysis system, the test-selection pipeline ran in parallel with the pipeline that runs the full test suite for each new PR. The recall of the new test selection pipeline was measured. Out of all the failing tests, we measured if the new pipeline selects the same failing tests. We didn’t care about the tests that pass because it’s only the failing tests that cause trouble.

We measured our results using three metrics.

Failure Recall

We define recall as the percentage of legitimately failing tests, excluding intermittently failing tests, that our system selected. We want this to be as close as possible to 100%. It’s hard to measure this accurately because of the occurrence of intermittently failing tests. Instead, we approximate the recall by looking at the number of consistently failing tests merged into main.

After two months that the project has been active, out of the 8,360 commits that were merged, we’ve only failed to detect five failing tests that landed on main. We also managed to resolve most of the root causes of those missed failures, so the same problems don’t repeat in the future.

We achieved a 99.94% recall.

Speed Improvement

The overall selection rate, the ratio of selected tests over total number of tests, is about 60%:

Percentage of selected test files per build

About 40% of builds run fewer than 20% of tests. This shows that many developers will significantly benefit from our test selection system:

Percentage of builds that selected fewer than 20% of all tests

Compute Time

In total, we’re saving about 25% compute time. This is measured by adding up the time spent preparing and running tests on every docker container in a build, and averaging that across all builds. It didn’t decrease more because a significant chunk of computing time is still used for setting up containers, databases, and pulling caches. Note that we’re also adding compute time by running the dynamic analysis for every deployed commit on main. We estimate that this will undo some, but not all of the infrastructure cost savings.

Other Approaches We Explored

Prior to choosing dynamic analysis, we explored other approaches but ultimately ruled them out.

Static Analysis

To determine a dependency graph, we briefly explored using Sorbet, but this would only be possible if the entire code base was converted to strict Sorbet type. Unfortunately, the code base is only partially in strict Sorbet type and too big for my team to convert the rest.

Machine Learning

It’s possible to use machine learning to find the dependency graph. Facebook has an excellent research paper [PDF] on it. We chose dynamic analysis at Shopify because we’re not sure if we have enough data to make the prediction, and we want to choose an approach that’s deterministic and reproducible.

More Machines for Tests

We tried adding more machines for the test suite. Unfortunately, the performance didn’t increase linearly as we scaled horizontally. In fact, tests on average take longer as we increase the number of machines past a certain number. Increasing machines doesn’t reduce intermittently failing tests and it increases the possibility of failing connections to sidecars, thus increasing test retries.

Benefits of Running Fewer Tests

There are three major benefits of selectively running tests:

Developers get faster feedback from tests.
The likelihood of encountering intermittently failing tests decreases, and. Thus, it increases the speed of developers further.
CI costs less.

Skepticism about Dynamic Analysis/Test Selection

Before the feature was rolled out, many developers were skeptical that it would work. Frankly, I was one of them. Many people voiced their doubts both privately and openly. However, much to my surprise, after it went live, people were silent. On average, developers request running the full test suites on under 2% of the pull requests.

If you’re in a similar situation, I hope our experience helps you. It’s hard for developers to embrace the idea that some tests won’t be run when the importance of tests is ingrained in our heads.

If this sounds like the kind of problems you'd enjoy solving, come work for us. Check out the Software Development at Shopify (Expression of Interest) career posting and apply specifying an interest in Developer Acceleration.

Spark Joy by Running Fewer Tests