Test Budget: Time Constrained CI Feedback

At Shopify we run more than 170,000 tests in our core monolith. Naturally, we're constantly exploring ways to make this faster, and the Test Infrastructure team analyzed the feasibility of introducing a test budget: a fixed amount of time for tests to run. The goal is to speed up the continuous integration (CI) test running phase by accepting more risk. To achieve that goal we used prioritization to reorder the test execution plan in order to increase the probability of a fast failure. Our analysis provided insights into the effectiveness of executing prioritized tests under a time constraint. The single most important finding was that we were able to find failures after we had run only 70% of the test-selection suite.

The Challenge

Shopify’s codebase relies on CI to avoid regressions before releasing new features. As the code submission rate grows along with the development team size, so does the size of the test pool and the time between code check-ins and test result feedback. As seen in the figure below developers will occasionally get late CI feedback while other times the CI builds complete in under 10 minutes. This non-normal cadence of receiving CI feedback leads to more frequent context switches.

Various techniques exist to speed up CI such as running tests in parallel or reducing the number of tests to run with test selection. Balancing the cost of running tests against the value of running them is a fundamental topic in test selection. Furthermore, if we think of the value as a variable then we can make the following observations for executing tests:

No amount of tests can give us complete confidence that no production issue will occur.
The risk of production issues is lower if we run all the tests.
As complexity of the system increases, the value of testing any individual component decreases.
Not all tests increase our confidence level the same way.

The Approach

It’s important to note first the difference between the test selection and test prioritization. Test selection selects all tests that correspond to the given changes using a call graph deterministically. On the other hand, test prioritization orders the test with the goal of discovering failures fast. Also, that sorted set won’t always be the same for the same change since the prioritization techniques use historical data.

The system we built produces a prioritized set of tests on top of test selection and constrains the execution of those tests using a predetermined time budget. Having established that there’s a limited time to execute the tests, the next step is to determine what’s the best time to stop executing tests and enforce it.

The time constraint or budget, and where the name Test Budget comes from, is the predetermined time we terminate test execution while considering that we must find as many failures as possible during that period of time.

System Overview

The guiding principle we used to build the Test Budget was: we can't be sure there will be no bugs in production that affect the users after running our test suite in any configuration.

To identify the most valuable tests to run within an established time budget, the following steps must be performed:

identify prioritization criteria and compute the respective prioritized sets of tests
compute the metrics for all criteria and analyze the results to determine the best criteria
further analyze the data to pick a time constraint for running the tests

The image below gives a structural overview of the test prioritization system we built. First, we are computing the prioritized sets of tests using historical test results for every prioritization criterion (for example the criterion failure rate will have it’s own prioritized set of tests). Then, given some commit and the test-selection set that corresponds to that commit, we’re executing the prioritized tests as a CI build. These prioritized tests are a subset of the test selection test suite.

First, the system obtains the test result data needed by the prioritization techniques. The data is ingested into a Rails app that’s responsible for the processing and persistence. It exposes the test results through a HTTP API and a GUI. For persistence, we chose to use Redis, not only because of the unstructured nature of our data, but also because of the Redis Sorted Sets data structure that enables us to query for ordered sets of tests in O(logn) time, where n is the number of elements in the set.

The goal of the next step is to select a subset of tests given the changes of the committed code. We created a pipeline that’s being triggered for a percentage of the builds that contain failures. We execute this pipeline with a specific prioritization each time and calculate metrics based on it.

Modeling Risk

During the CI phase, the risk of not finding a fault can be thought of as a numbers game. How certain are we that the application will be released successfully if we have tested all the flows? What if we test the same flows 1000 times? We leaned on test prioritization to order the tests in such a way that early faults are found as soon as possible, which encouraged the application of heuristics as the prioritization criteria. This section explores how to measure the risk of not detecting faults using the time budget and if we don’t skip a test randomly but after using the best heuristics.

Prioritization Criteria

We built six test prioritization criteria that produced a rating for every test in the codebase:

failure_rate: how frequently a test fails based on historical data.
avg_duration: how fast a test executes. Executing faster tests allows us to execute more tests in a short amount of time.
churn: a file that’s changing too much could be more brittle.
coverage: how much of the source code is executed when running a test.
complexity: based on the lines of code per file.
default: this is the random order set.

Evaluation Criteria

After we get the prioritized tests, we need to evaluate the results of executing the test suite following the prioritized order. We chose two groups of metrics to evaluate the criteria:

The first includes the Time to First Failure (TTFF) which acts as a tripwire since if the time to first failure is 10 minutes then we can’t enforce a lower time constraint than 10 minutes.
The second group of metrics includes the Average Percentage of Faults Detected (APFD) and the Convergence Index. We needed to start thinking of the test execution timing problem using a risk scale, which would open the way for us to run fewer tests by tweaking how much risk we will accept.

The APFD is a measure of how early a particular test suite execution detects failures. APFD is calculated using the following formula:

The equation tells us that in order to calculate the APFD we will take the difference between 1 and the sum of the positions of the tests that expose each failure. In the equation above:

n is the number of test cases in the test suite
m is the total number of failures in the test suite
Fi is the position in the prioritized order set of the first test that exposes the fault i.

The APFD values range from 0 to 1, where higher APFD values imply a better prioritization.

For example, for the test suites (produced by different prioritization algorithms) T1 and T2 that each have a total number of tests (n) = 100 and total number of faults (m) = 4, we get the following matrix:

	T1	T2
F1	1	4
F2	10	20
F3	30	60
F4	60	61

And we calculate their APFD values:

The first prioritization has a better APFD rating (0.7525 versus 0.6425).

The Convergence Index tells us when to stop testing within a time constrained environment because a high convergence indicates we’re running fewer tests and finding a big percentage of failures.

Convergence Index = Percentage of faults detectedPercentage of tests executed — Convergence Index

The formula to calculate the Convergence Index is the percentage of faults detected divided by the percentage of tests executed.

Data Analysis

For each build, we created and instrumented a prioritized pipeline to produce artifacts for building the prioritization sets and emit test results to Kafka topics.

The prioritization pipeline in Buildkite

We ran the prioritized pipeline multiple times to apply statistical analysis to our results. Finally, we used Python Notebooks to combine all the measurements and easily visualize the percentiles. For APFD and TTFF we decided to use boxplot to visualize possible outliers and skewness of the data.

When Do We Find the First Failing Test?

We used the TTFF metric to quantify how fast we could know that the CI will eventually fail. Finding the failure within a time window is critical because the goal is to enforce that window and stop the test execution when the time window ends.

In the figure above we present the statistical distributions for the prioritization criteria using boxplots. The median time to find a failure is less than five minutes for all the criteria. Complexity, churn, and avg_duration have the worst third quartile results with a maximum of 16 minutes. On the other hand, default and failure_rate gave more promising results with a median of less than three minutes.

Which Prioritization Criteria Have the Best Failure Detection Rates?

We used the APFD metric to compare the prioritization criteria. A higher APFD value indicates a better failure detection rate.

The figure above presents the boxplots of APFD values for all the prioritization criteria. We notice that there isn’t a significant difference between the churn and complexity prioritization criteria. Both of these have median values close to zero which make them very inappropriate for prioritizing the tests. We also see that the failure_rate has the best detection rate that’s marginally better than the random (default) one.

Which Prioritization Criteria Has the Quickest Convergence Time?

The increase of test failures detected decreases as we execute more tests. This is what we visualized with the convergence index data and using a step chart. In all the convergence graphs the step is 10% of the test suite executed.

The above figure indicates that while all the criteria find a percentage of faults after running only 50% of the test suite for the mean, the default and failure_rate prioritization criteria stand out.

For the mean case, executing 50% of the test suite finds 50% of the failures using the default prioritization and 60% using the failure_rate. The failure_rate criterion is able to detect 80% of the failures after running only 60% of the test suite.

How Much Can We Shrink the Test Suite Given a Time Constraint?

The p20 and p5 visualizations of the convergence quantify how reliably we could detect faults within the time budget. We use the p20 and p5 visualizations because a higher value of convergence is better. The time budget is an upper bound. The CI system executes the tests up to that time bound.

For example, after looking at the p20 (80% of builds) plot (the above figure), we need to execute 60% of the test-selection tests (the test-selection suite is 40% of the whole test suite on median) to detect an acceptable amount of failures. Then, the time budget is the time it takes to execute 60% of the selected tests.

Looking at the plot of the 5th percentile (95% of the builds) plot (see the figure above), we notice that we could execute 70% of the already test-selection reduced test suite to detect 50% of the failures.

The Future of Test Budget Prioritization

By looking at our convergence and TTFF and if we want to emphasize the discovery of a faulty commit, that is the first failure, we can see that we could execute less than 70% of the test-selection suite.

The results of the data analysis suggest several alternatives for future work. First, deep learning models could utilize the time budget as a constraint while they are building the prioritized sets. Prioritizing tests using a feedback mechanism could be the next prioritization to explore, where tests that never run could be automatically deleted from the codebase, or failures that result in problems during production testing could be given a higher priority.

Finally, one possible potential for a Test Budget prioritization system could be outside the scope of the Continuous Integration environment: the development environment. Another way of looking at the ordered sets is that the first tests are more impactful or more susceptible to failures. Then we could use such data to inform developers during the development phase that parts of the codebase are more likely to have failing tests in CI. A message such as “this part of the codebase is covered by a high priority test which breaks in 1% of the builds” would give feedback to developers immediately while they’re writing the code. It would shift testing to the left by giving code suggestions during development, and eventually reduce the costs and time of executing tests in the CI environment.

If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by default.