Using Betas to Deploy New Features Safely

In the software industry, changing a running system is dangerous. As the saying goes, “If it ain’t broke, don’t fix it.” Unfortunately, even if code were perfect, progress marches ever forward, and new features are continually being added. One of the worst feelings for a developer is learning that something you just shipped has caused an issue, or even worse, an incident.

For companies like Shopify that practice continuous deployment, our code is changing multiple times every day. We have to de-risk new features to ship safely and confidently without impacting the million+ merchants using our platform.

Beta flags are one approach to feature development that gives us a number of notable advantages, among them:

  • Reduce blast radius of changes–if they happen not to go as planned–by rolling out the beta to a percentage of the subject group.
  • Instantaneously choose when the change is active in production. Without beta flags, changes would be active at the time of deployment.
  • Instantaneously rollback the feature.
  • Ship new code paths, typically inactive, but allow devs to test these code paths in production.

Anatomy of a Beta Flag

The word “beta” is a bit overloaded in the software industry, being used even to refer to launched products. We’ll define some primitives for clarity.

Subject: A concept that you want to define a control plane against. For multi-tenant SAAS applications, this is typically the model corresponding to your tenant. For Shopify, this is typically our concept of a shop. While designing, consider a polymorphic approach so you can implement betas against multiple types of things.

BetaIdentifier: This is often a simple string that represents the feature you are developing. For example, “multi_location.” Keep in mind that if you use a string instead of an auto-incrementing integer, you should be wary of case-sensitivity and of accidentally re-using this same string in the future. Metadata can be associated with this and should be considered for internal documentation/tooling purposes. For example, a high-level description of the feature, a list of chat channels requiring notification about this feature’s rollout, a list of owners, descriptions of the behavior when this feature is enabled or disabled, etc.

BetaFlag: At the lowest level, this is a small piece of data associated with a Subject. This can be implemented as a “Subject has_many BetaFlag” relationship. Inside the BetaFlag, we have a BetaIdentifier and typically some created_at/updated_at timestamps.

For a given BetaFlag, we can:

  • Check if an instance of Subject has this BetaIdentifier? If so, the feature is turned on for the Subject.
  • Grab a list of all Subjects with this BetaIdentifier.

This check allows us explicit per-Subject feature toggling.

BetaRollout: This is data that lives unrelated to any one particular Subject. Inside BetaRollout, we have:

  • beta_name: a BetaIdentifier, so we can know which concept we’re dealing with.
  • percentage_rollout: an integer (0..100) reflecting which percentage of Subject you wish to enable the BetaFlag for.
  • A method to calculate whether to consider a Subject as “rolled out”.

For a given BetaRollout, we can ask, “Does an instance of Subject have this BetaIdentifier?” If so, the feature is turned on for the Subject.

If the BetaRollout record does not exist, we will assume that the feature is turned off.

Here’s an example of a performant way to implement the BetaRollout#enabled? method:

By calculating a digest of the two identifiers and converting it into an integer modulo 100, we ensure that each percentage rollout will hit a different set of Subjects as the % increases. This means that a different subset of Subjects is affected every time for every beta rollout we do. Why does this matter? This prevents potential negative effects that occur due to rollouts from affecting the same subset of subjects consistently.

This implementation also has a nice invariant: As the percentage increases (For example, from 0% to 11%, then 11% to 20%, etc.), the previous set of Subjects that saw the feature is still seeing the feature. The digest modulo 100 remains static even as the rollout_percentage changes. This is key to ensuring a nice user experience because it would be frustrating to experience a feature appearing and disappearing (seemingly randomly) as the rollout percentage increases.

The ultimate effect is that every BetaRollout has a uniquely consistent growing set of Subjects experiencing the feature in the journey from 0% to 100% (or from X% to 0%).

Bundling These Concepts

Once we define a BetaIdentifier, we can do the following things:

  • Apply the new feature to a particular Subject by creating and associating a new record of BetaFlag to the Subject.
  • Roll out the feature to X% of Subjects by creating a BetaRollout and setting it to a particular % value.

These two separate concepts are related to the BetaIdentifier. The question about whether or not a new feature is enabled then looks something like this for a given Subject:

In English, this equates to “Does the subject have the flag explicitly (via BetaFlag) or implicitly (via BetaRollout)?” It’s helpful to define this in some fashion, so we’ll call this unified view the Beta.

What happens when we want to revert the beta flag? Perhaps the feature is bugged. There are two cases to consider:

  • The flag was rolled out using the % mechanism only.
  • The flag was rolled out by manually applying the flag to specific instances of the Subject.

When the flag is rolled out using the % mechanism only, the rollback process is extremely straightforward: change the % rollout value of the BetaRollout to `0`. Note that anyone with the explicit application of the flag will still see the feature.

When the flag is rolled out to potentially thousands of Subjects by direct application, we find ourselves in a more challenging situation. We have an ongoing incident and thousands of records in the DB that we can’t quickly remove. In the best case, we have to write a maintenance task to alter the database for these thousands of Subjects. In the absolute worst case, it’s just not possible!

It would’ve been tempting to refer to the Beta concept directly throughout our code because it seems to give us all the flexibility we want. However, we’ve just discovered a case where we can’t easily rollback. How do we proceed?

Taking Things a Step Further

Instead of having all of our code refer to a Beta directly, we should be writing an abstraction a layer higher. In our hypothetical situation, we can imagine something called a “Feature” that is defined like:

With another layer above the primitives like this, we’ve just introduced flexibility for developers to:

  • Apply a feature directly to a specific Subject (or thousands of them).
    For example: Adding a feature to a production Subject for testing or for a particularly unique rollout that cannot be %-based.
  • Apply this feature to a random %-based sampling of all other Subjects.
    For example: A typical rollout of a feature might encompass slowly rolling out to an ever-increasing population of Subjects
  • Apply an escape hatch for specific Subjects that encounter problems with the feature by applying the my-cool-new-feature-opt-out flag directly to the Subject.
    For example: Some bugs encountered might not be severe enough to roll back the rollout for all Subjects, but we want to allow a specific Subject to disable the feature
  • Apply a kill switch for all Subjects by rolling out the my-cool-new-feature-opt-out flag to 100%.
    For example: A Beta has been applied to thousands of Subject but can’t easily be removed, but we need to halt the feature immediately

If we imagine a runaway train of an incident caused by rolling out your feature, we see that we should hopefully be able to instantly resolve it by applying the kill switch! Power, safety, speed.

Often, beta features might start with some simple qualifications. Invariably, they often evolve to become something more like:

Using a higher-level abstraction makes room for the feature to easily change qualification as the business demands change. In our experience, even things that seem like simple Yes|No Beta flags often evolve to have other requirements. This is all to say that you should consider avoiding referring to the lower level primitives directly. It’s easier to change one method than dozens of interspersed Beta.enabled?(...) invocations. This generalizes nicely to exposure over an API: If the feature evolves, but N clients are still referring directly to the primitives, we can’t necessarily update all of them simultaneously without the higher-level abstraction being exposed.

Some Things to Keep in Mind

This is just one path to developing new features safely and efficiently that we’ve found to be highly effective at Shopify. The formulations we developed here are informative, and your implementation will surely differ.

Data Structure Differences and Reconciliation

If code path A generates data incompatible with code path B, you will likely have problems when you roll back unless you have preemptively considered the rollback experience upfront. Always consider what happens when you switch paths.

This beta pattern optimizes the ability to change code paths in production quickly, but the lingering data can often be forgotten.

Not all Features Can be Developed Sanely in this Manner

Similar to the previous point, some features have impacts on your underlying data model. Even if we considered migrating between the two code paths, you might encounter a feature where rolling back leaves things in a worse or nonsensical state. Some things simply don’t make sense to have the ability to be rolled back. Some features are better developed iteratively rather than switched on all at once.

Not all Features Can be Rolled Back Without a User Disruption

Imagining a new feature: once a user has come to depend on it, it may be impossible to roll back without leaving the user confused. Imagine introducing a new business concept that wasn’t fully fleshed out. You want to prevent future users from using/seeing it, but you still need to accommodate the current set of users that have already used it and come to rely upon it.

Suppose you were using Features to abstract the underlying primitives. In that case, you could allow a subset of users to continue using the feature while preventing further access for all users that have not yet used the feature. You could accomplish this by manually applying the BetaFlag to a set of Subjects while rolling back the BetaRollout down to 0%.

Avoid Reusing BetaIdentifier Names

Suppose you’re using a magic string as a handle. In that case, you can imagine that if you were to reuse a previously used handle, then you may have BetaFlag identifiers and BetaRollout in your database already. If these features remained enabled without proper deprecation (or deletion from the database), you can imagine that as soon as you use the same handle again, the flag is immediately activated on some Subjects, which is not what you expect. In practice, this is extremely rare.

Treat Beta Rollouts as if They Were Deploys

While you’ve already shipped the underlying code for the beta, typically, it lies inactive until we apply the flags as described above. One of the most impactful things we’ve done at Shopify is to consider %-based rollouts as being as important as deploys.

When someone changes a %-based rollout at Shopify, our #operations chat channel is informed of the change and why it was changed. If exceptions start to occur or our metrics start to decline, we now have another data point to consider. Previously, we were operating in the dark in the face of beta rollouts. Instantly changing an application’s running code paths sounds an awful lot like a “deploy.”

When someone applies a beta flag to a Subject directly at Shopify, the teams that have developed the feature are informed with a Slack message, thanks to the metadata we associated with the BetaIdentifier described above. This can help to prevent errant beta applications and inform the teams about “opt-out” beta applications. The teams developing the feature can better steward it armed with the knowledge of what is changing.

Realize the Testing Nuances

When you’ve added your new feature, you’ve also added unit tests for the new code paths, probably even some integration tests. These tests all look roughly similar: first, enable the feature, then test the new code path. All is well. Or is it?

Depending on how deeply the new feature touches your system, unit tests may be enough. However, you must consider that your entire suite of tests is testing the old code path with the exception of these brand-new tests you added. If the feature is set to 100% (and sits there for months), then almost all of your tests are testing a code path that doesn’t truly occur in production any longer. We’ve seen this present when removing beta flags after rollouts: suddenly, hundreds of tests fail because the old tests didn’t test the new code path. Typically, it’s a minor inconvenience, and some extra effort must be spent auditing and adjusting the tests. In our experience, this typically hasn’t been a problem for most small features.

Ultimately, each feature set adds another permutation to the set of all possible code paths our code could run. We wouldn’t necessarily run the entire test suite twice–once with beta off, once with beta on–because this would be a combinatorial explosion and generally a waste of CI time. For some particularly hairy features, we’ve opted to run whole test files (For example, specific controllers, not the entire suite) twice for an extra degree of confidence.

As a practical note to the dev working on the feature, it can be illuminating to hardcode the feature to “true” and witness which tests fail on branch CI, potentially pointing to missed considerations and edge cases within the feature’s implementation.

Clean up Your Work

It can be common to see code rolled out to 100%, but the beta flags still exist months or even years later. If the previous code path is still acceptable, it can be a good practice to keep the beta flags around for a couple of months in case something comes up. However, given enough time, teams will cycle off the project, and this pre-beta code path now becomes dead code. Ultimately, this is tech debt that needs to be cleaned up.

Conclusion

These primitives and patterns have allowed Shopify to develop a large variety of changes: extensive new features, small and large refactors, and the toggling of performance improvements, to name a few. More so, armed with these primitives, we have the confidence to ship boldly, knowing that we have mechanisms to control the software once even after it has been deployed. The power that this level of control gives you can’t be understated.

However, they’re not a panacea. Our implementations of these concepts are hundreds of lines long and consider various things, such as caching strategies to load these records for our Subjects efficiently. Hopefully, these concepts provide you with a useful starting point to empower your developers and make your software more robust while defining what “beta” means for you.

Anthony Cameron has been a Full Stack Staff Developer at Shopify for over 7 years and a member of numerous teams. Anthony's current team helps merchants get orders to buyers as quickly as possible while doing less work. If you’d like to learn more about Anthony, check him out on Twitter.


We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.