What do you do with a finite amount of time to deal with an infinite number of things that can go wrong?
We prepare for Black Friday Cyber Monday (BFCM) and other high-traffic events during the year to make sure the Shopify platform stays up so our merchants can sell to their buyers. To do this, we built an infrastructure platform at a large scale that is highly complex, interconnected, globally distributed, requiring thoughtful technology investments from a network of teams. We’re changing how the internet works, where no single person can oversee the full design and detail at our scale.
This year over BFCM, we served 75.98M requests per minute to our commerce platform at peak. That’s 1.27M requests per second. Working at this massive scale in a complex and interdependent system, it would be impossible to identify and mitigate every possible risk. This post breaks down a high-level risk mitigation process into four questions that can be applied to nearly any scenario in order to help you make the best use of your time and resources available.
1. What are the risks?
To inform mitigation decisions, you need to first understand the current state of affairs. We expand our breadth of knowledge by learning from people from all corners of the platform. We run “what could go wrong” (WCGW) exercises where anyone building or interested in infrastructure can highlight a risk. These can be technology risks, operational risks, or something else. Having this unfiltered list is a great way to get a broad understanding of what could happen.
The goal here is visibility.
2. What is worth mitigating?
Great brainstorming leaves us with a large and daunting list of risks. With limited time to fix things, the key is to prioritize what is most important to our business. To do this, we vote on risks, then gather technical experts to discuss highest ranked risks in more detail, including their likelihood and severity. We make decisions about what and how to mitigate, and which team will own each action item.
The goal here is optimizing how we spend our time.
3. Who makes what decisions?
In any organization, there are times when waiting for a perfect consensus is not possible or not effective. Shopify moves tremendously fast because we make sure to identify decision makers, then empower them to gather input, weigh risks/rewards, and come to a decision. Often the decision is best made by the subject matter expert or who bears the most benefit or repercussions of whatever direction we choose.
The goal here is aligning incentives and accountability.
4. How do you communicate?
We move fast, but still need to keep stakeholders and close collaborators informed. We summarize key findings and risks from our WCGW exercises so that we all land on the same page about our risk profile. This may include key risks or single points of failure. We over-communicate so that we’re aligned, aware, and stakeholders have opportunities to interject.
The goal here is alignment and awareness.
Solving the Right Things When There Is Uncertainty
Underlying all these questions is the uncertainty in our working environment. You never have all the facts or know exactly which components will fail when and how. The best way to deal with uncertainty is using probability.
Expert poker players know that great bets don’t always yield great outcomes and bad bets don’t always yield bad outcomes. What’s important is to bet on the probability of outcomes, where over enough rounds, your results will converge to expectation. The same applies in engineering, where we constantly make bets and learn from them. Great bets require clearly distinguishing the quality of your decisions versus outcomes. It means not over-indexing on bad decisions that led to lucky outcomes or great decisions that happen to run into very unlucky scenarios.
Knowing that we can’t control everything also helps us stay calm, which is vital for us to practice good judgment in high-pressure situations.
When it comes to BFCM (and really, life in general), no one can predict the future or fully protect against all risks. The question is, what would you change looking back? In hindsight, would you feel confident that you prioritized the most important things and made thoughtful bets using the information available to you? Did you facilitate meaningful discussions with the right people? Could you justify your actions to your customers and their customers?
Kathryn Tang leads Shopify's Engineering Operations team. She's been at Shopify for over 5 years, working with a multitude of R&D and commercial teams to derive business insights and guide operating decisions to help us scale.
We all get shit done, ship fast, and learn. We operate on low process and high trust, and trade on impact. You have to care deeply about what you’re doing, and commit to continuously developing your craft, to keep pace here. If you’re seeking hypergrowth, can solve complex problems, and can thrive on change (and a bit of chaos), you’ve found the right place. Visit our Engineering career page to find your role.