Scaling a Majestic Monolith
Shopify’s primary application is a monolithic Rails application that powers our cloud-based, multi-channel commerce platform for 600,000+ merchants in over 175 countries. As we continue to grow the number of developers working on the app, our tooling has grown with them. At Shopify, we mostly follow a trunk based development workflow, and every week more developers write more code, open more pull requests, and merge more commits to master. Occasionally, master merges can go wrong. For example, two unrelated merges can affect one another, the introduction of a new flaky test, or even accidental merges of work in progress. Even a low percentage of a growing number of failed merges will eventually become too big to ignore, so we needed to improve our tooling around merging pull requests.
Shipit is our open source deployment coordination tool. It’s our source of truth of what is deployed, what’s being deployed (if anything), and what’s about to be deployed. There are times we don’t want any more commits merged to master (e.g. if CI on master is failing; if there’s an ongoing incident and we don’t want any more changes introduced to the master branch; or if the batch size of undeployed commits is too high) and Shipit is also the source of truth for this. Originally, we expected developers to check the status of master by hand before merging. This quickly became unsustainable, so Shipit has a browser extension which tells the developer the status of stack right on their pull request:
If for some reason, it’s unsafe to merge, then the developer is asked to hold off:
Developers had to manually revisit their pull request to see if it was safe to merge. Large batches of undeployed commits are also considered unsafe for more merges, a condition Shipit considers ‘backlogged’:
A rapidly growing development team brings scaling challenges (and lots of frustration) because when a stack returned to a mergeable state, developers rushed to get their changes merged before the pipeline became backlogged again. As we continued to grow, this became more and more disruptive, and so the merge queue idea was born.
The Merge Queue
Shipit was the obvious candidate to house this new automation — it’s the source of truth for the state of master and deploys, and already is integrated with Github. We added the ability to enqueue pull requests for merge directly within Shipit (you can see how it’s configured here in the Shipit Github repo). Once queued and the state of master is ok, a pull request is merged very quickly. We didn’t want our developers to have to leave Github to enqueue pull requests, and we looked at the browser extension to solve that problem!
If a stack has the merge queue enabled, we inject an ‘Add to merge queue’ button. Integrating the button with the normal development flow was important for developer adoption. During testing, we discovered that people still merged directly to master for routine merging and interviews revealed that they instinctively ‘pressed the big green button to merge’. We wanted the merge queue to become the default mechanism for merges and by tweaking our extension to de-emphasise the default ‘Merge pull request’ button by turning it gray, and we saw a further boost in adoption.
By bringing the merge event into the regular deploy pipeline, we’re able to codify some other things we consider best practices — for example, the merge queue can be configured to reject pull requests if it's diverged from its merge base beyond configurable thresholds. Short-lived branches are very important for trunk based development, so old branches (both in terms of date and number of commits diverged) represent an increased risk, and need to be discouraged. The merge queue is configured inside shipit.yml, so the discussions that inform these decisions are all traceable back to a pull request!
It’s important to stress that the merge queue is highly encouraged, but not enforced. At Shopify, we trust our developers to override the automation, if they feel it’s required, and merge directly to master.
After launching the merge queue, we quickly learned that the queue wasn’t always behaving as developers expected. We configured the queue to require certain CI statuses before merging and if a pull request wasn’t ready, Shipit would eject it from the queue, making the developer re-enqueue the pull request later. There are some common situations where this causes frustrations for developers. For example, after a code review, some small tweaks are made to satisfy reviewers, and the pull request is ready to merge pending CI passing. The developer wants to queue the pull request for merging and move on to their next task but needs to monitor CI. Similarly, this also happened with minor changes (readme updates and the like) and developers would save a lot of time if they could queue-and-forget, so that’s what we did! If CI is pending on a queued pull request, Shipit will wait for CI to pass or fail, and merge or reject as appropriate.
We received a lot of positive feedback for that small adjustment, and for the merge queue in general. By getting automation involved earlier in the pipeline, we’re able to take some of the load off our developers, make them happier, and more productive. Over 90% of pull requests to Shopify’s core application are using Shipit with the merge queue! That makes Shipit the largest contributor to our monolith.
Unsafe Commits
A passing, regularly exercised CI pipeline gives you high confidence that a given changeset won’t cause any negative impacts once it reaches production. Ultimately, the only way to see the impact of your changes is to ship them, and sometimes that results in a breaking change reaching your users. You quickly roll back the deploy, stop the shipping pipeline, and investigate what caused the break. Once you identify the bad commit, it can be reverted on master, and the pipeline can resume, right? Consider this batch of commits on master, waiting to deploy:
- Good commit A
- Bad commit B
- Good commit C
- Good commit D
- Revert of commit B
How does your deployment tool know that deploying commit C or D is unsafe? Up until recently, we were relying on our developers to manage this situation by hand, manually deploying the first safe commit before unlocking the pipeline. We’d rather our developers focus on adding value elsewhere and decided to have Shipit manage this automatically where possible. Our solution comes in 2 parts:
Marking Commits as Unsafe for Deploy
If a commit is marked as unsafe, Shipit will not deploy that ref in isolation. In the above example, the bottom (oldest) commit might be deployed, followed by the remaining two commits together. This is the functionality we want but still requires manual configuration, so we complement this with automatic revert detection.
Automatic Revert Detection
If Shipit detects a revert of an undeployed commit, it will mark the original commit (and any intermediate commits between it and the revert) as unsafe for deploy:
This removes the need for any manual intervention when doing git revert as Shipit can continue to deploy automatically and safely.
In Conclusion
These new Shipit features allow us to ship faster, safer, and hands valuable time back to our developers. Shipit is open source, so you can benefit from these features yourself — check out the setup guide to get started. We’re actively exploring open sourcing the browser extension mentioned above, stay tuned for more updates on that!