Resiliency Planning for High-Traffic Events

On January 27, 2021 Shipit!, our monthly event series, presented Building a Culture of Resiliency at Shopify. Learn about creating and maintaining resiliency plans for large development teams, testing and tooling, developing incident strategies, and incorporating and improving feedback loops. The video is now available.

Each year, Black Friday Cyber Monday weekend represents the peak of activity for Shopify. Not only is this the most traffic we see all year, but it’s also the time our merchants put the most trust in our team. Winning this weekend each year requires preparation, and it starts as soon as the weekend ends.

Load Testing & Stress Testing: How Does the System React?

When preparing for a high traffic event, load testing regularly is key. We have discussed some of the tools we use already, but I want to explain how we use these exercises to build towards a more resilient system.

While we use these tests to confirm that we can sustain required loads or probe for new system limits, we can also use regular testing to find potential regressions. By executing the same experiments on a regular basis, we can spot any trends at easily handled traffic levels that might spiral into an outage at higher peaks.

This same tool allows us to run similar loads against differently configured shops and look for differences caused by the theme, configuration, and any other dimensions we might want to use for comparison.

Resiliency Matrix: What are Our Failure Modes?

If you've read How Complex Systems Fail, you know that "Complex systems are heavily and successfully defended against failure" and "Catastrophe requires multiple failures - single point failures are not enough.” For that to be true, we need to understand our dependencies, their failure modes, and how those impact the end-user experience.

We ask teams to construct a user-centric resiliency matrix, documenting the expected user experience under various scenarios. For example:

This user-centric resiliency matrix shows the potential failures and their impact on user experience. For example, can a user browse (yes) or check out (no) if MySQL is down.
User-centric resiliency matrix documenting expected user experience and possible failures

The act of writing this matrix serves as a very basic tabletop chaos exercise. It forces teams to consider how well they understand their dependencies and what the expected behaviors are.

This exercise also provides a visual representation of the interactions between dependencies and their failure modes. Looking across rows and columns reveals areas where the system is most fragile. This provides the starting point for planning work to be done. In the above example, this matrix should start to trigger discussion around the ‘User can check out’ experience and what can be done to make this more resilient to a single dependency going ‘down’.

Game Days: Do Our Models Match?

So, we’ve written our resilience matrix. This is a representation of our mental model of the system, and when written, it's probably a pretty accurate representation. However, systems change and adapt over time, and this model can begin to diverge from reality.

This divergence is often unnoticed until something goes wrong, and you’re stuck in the middle of a production incident asking “Why?”. Running a game day exercise allows us to test the documented model against reality and adjust in a controlled setting.

The plan for the game day will derive from the resilience matrix. For the matrix above, we might formulate a plan like:

This game day exercise allows us to test the model against reality and adjust in a controlled setting. This plan lays out scenarios to be tested and how they will be accomplished.
Game day planning scenarios 

Here, we are laying out what scenarios are to be tested, how those will be accomplished, and what we expect to happen. 

We’re not only concerned with external effects (what works, what doesn’t), but internally do any expected alerts fire, are the appropriate on-call teams paged, and do those folks have the information available to understand what is happening?

If we refer back to How Complex Systems Fail, the defences against failure are technical, human, and organizational. On a good game day, we’re attempting to exercise all of these.

  • Do any automated systems engage?
  • Do the human operators have the knowledge, information and tools necessary to intervene?
  • Do the processes and procedures developed help or hinder responding to the outage scenario?

By tracking the actual observed behavior, we can then update the matrix as needed or make changes to the system in order to bring our mental model and reality back into alignment.

Incident Analysis: How Do We Get Better?

During the course of the year, incidents happen which disrupt service in some capacity. While the primary focus is always in restoring service as fast as possible, each incident also serves as a learning opportunity.

This article is not about why or how to run a post-incident review; there are more than enough well-written pieces by folks who are experts on the subject. But to refer back to How Complex Systems Fail, one of the core tenets in how we learn from incidents is “Post-accident attribution to a ‘root cause’ is fundamentally wrong.”

When focusing on a single root cause, we stop at easy, shallow actions to resolve the ‘obvious’ problem. However, this ignores deeper technical, organizational, and cultural issues that contributed to the issue and will again if uncorrected.

What’s Special About BFCM?

We’ve talked about the things we’re constantly doing, year-round to ensure we’re building for reliability and resiliency and creating an anti-fragile system that gets better after every disruption. So what do we do that’s special for the big weekend?

We’ve already mentioned How Complex Systems Fail several times, but to go back to that well once more, “Change introduces new forms of failure.” As we get closer to Black Friday, we slow down the rate of change.

This doesn’t mean we’re sitting on our hands and hoping for the best, but rather we start to shift where we’re investing our time. Fewer new services and features as we get closer, and more time spent dealing with issues of performance, reliability, and scale.

We review defined resilience matrices carefully, start running more frequent game days and load tests and working on any issues or bottlenecks those reveal. This means updating runbooks, refining internal tools, and shipping fixes for issues that this activity brings to light.

All of this comes together to provide a robust, reliable platform to power over $5.1 billion in sales.

Shipit! Presents: Building a Culture of Resiliency at Shopify

Watch Ryan talk about how we build a culture of resiliency at Shopify to ensure a robust, reliable platform powering over $5.1 billion in sales.

 

Ryan is a Senior Development Manager at Shopify. He currently leads the Resiliency team, a centralized globally distributed SRE team responsible for keeping commerce better for everyone.

We're planning to DOUBLE our engineering team in 2021 by hiring 2,021 new technical roles (see what we did there?). Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Help us scale & make commerce better for everyone.