Black Friday and Cyber Monday are the biggest days of the year at Shopify with respect to every metric. As the Infrastructure team started preparing for the upcoming seasonal traffic in the late summer of 2014, we were confident that we could cope with the traffic, and determined resiliency to be the top priority. A resilient system is one that functions with one or more components being unavailable or unacceptably slow. Applications quickly become intertwined with their external services if not carefully monitored, leading to minor dependencies becoming single points of failure.
For example, the only part of Shopify that relies on the session store is user sign-in - if the session store is unavailable, customers can still purchase products as guests. Any other behaviour would be an unfortunate coupling of components. This post is an overview of the tools and techniques we used to make Shopify more resilient in preparation for the holiday season.
Testing for resiliency with Toxiproxy
We started by filling out a matrix of service dependencies. Horizontal rows denote external services, and vertical columns show sections of the application. A field marks the state of the sections of the application when a given external service is in a degraded state. This gave us an overview of the work which had to be done. We built separate matrices for dependent services being either somewhat degraded, or completely unavailable.
Filling this out accurately was not necessarily simple. Simulating a service being down in development was not trivial, and simulating a slow service was even harder. With tc(8)
and iptables(8)
we were able to fill out the matrix. However, these tools don’t work well for automated integration tests because they require sudo
privileges.
Having tests for the matrix was a must, otherwise we couldn’t guarantee the state of the matrix wouldn’t degrade over time. Since the tools mentioned previously require root access, we investigated proxies to simulate latency and downtime at the TCP level, but didn’t find one that suited our needs. We needed an online API to edit proxies and to support deterministic latencies, which made it suitable for integration testing. As one of the first steps to expand our resiliency toolkit, we wrote Toxiproxy, a TCP proxy to simulate network conditions which developers can run locally for easy testing.
When running Shopify locally, we can tamper with connections directly via the Toxiproxy HTTP API from the browser:
We write all our resiliency integration tests with Toxiproxy:
Fallbacks for unavailable resources
Let’s take the scenario from before, where the sessions store is down, and make it resilient. In Shopify’s case, when a customer visits a storefront they are not required to be signed in as a customer. It’s nice to have that functionality, but checkouts can still happen as a guest.
Imagine a method that runs before every page load to fetch the customer for the current session (if one exists):
There are two data stores involved here - the primary data store with the customer data, and the data store where the session
is stored. In this case we are aiming for resiliency to the session store being down, not the primary data store with customer data. With Toxiproxy we can easily write a failing test for our assumption:
Then we can make our method resilient. With this our test will pass:
In a large code base, the rescue
pattern doesn’t scale well. We don’t want to pollute the entire code-base with rescue
every time an external resource is accessed. In Shopify we use decorators around data stores to provide fallbacks. This prevents developers from having to think about resiliency all the time, unless they’re introducing a new set of functionality relying on an external resource. In that case they’re expected to write a resilient wrapper for it.
Failing fast with circuit breakers
One of the problems that had been hanging over our heads was how to handle slow resources. In a distributed system, failing services are a luxury; slow and unresponsive ones are a nightmare. Timeouts across a system decrease the capacity of a cluster tremendously. If every single request has to hit a 200ms
timeout before it raises an exception (which is either gracefully handled by the fallback, or fails fast to free up the resources), it ultimately means that your application’s overall performance and capacity is significantly degraded. Timeouts are not enough at scale. Furthermore, constantly knocking on the door at a service that’s rejecting connections is wasteful and can make the problem worse.
The circuit breaker pattern is based on a simple observation - if we hit a timeout for a given service one or more times, we’re likely to hit it again for some amount of time. Instead of hitting the timeout repeatedly, we can mark the resource as dead for some amount of time during which we raise an exception instantly on any call to it. This is called the circuit breaker pattern, which we now use a fair bit in Shopify.
When we do a Remote Procedure Call (RPC), it will first check the circuit. If the circuit is rejecting requests because of too many failures reported by the driver, it will throw an exception immediately. Otherwise the circuit will call the driver. If the driver fails to get data back from the data store, it will notify the circuit. The circuit will count the error so that if too many errors have happened recently, it will start rejecting requests immediately instead of waiting for the driver to time out. The exception will then be thrown back to the original caller. If the driver’s request was successful, it will return the data back to the calling method and notify the circuit that it made a successful call.
Failing even faster with Semian
Imagine if the timeout for our data store isn’t as low as 200ms, but is actually 10s. For example, you might have a relational data store where for some customers, 10s queries are (unfortunately) legitimate. Reducing the time of worst case queries requires a lot of effort, and we are working on it. Dropping the query suddenly could potentially leave some customers unable to access certain functionality. High timeouts are especially important in a non-threaded environment where blocking IO means a worker is effectively doing nothing.
In this case, circuit breakers aren’t sufficient. Assuming the circuit is shared across all processes on a server, it will still take at least 10s before the circuit is open—in that time every worker is blocked. Meaning we are in a reduced capacity state for at least 20s, with the last 10s timeouts occurring just before the circuit opens at the 10s mark when a couple of workers have hit a timeout and the circuit opens. We thought of a number of potential solutions to this problem - stricter timeouts, grouping timeouts by section of our application, timeouts per statement—but they all still revolved around timeouts, and those are extremely hard to get right.
Instead of thinking about timeouts, we took inspiration from Hystrix by Netflix and the book Release It (the resiliency bible), and look at our services as connection pools. On a server with W
workers, only a certain number of them are expected to be talking to a single data store at once. Let’s say we’ve determined from our monitoring that there’s a 10% chance they’re talking to mysql_shard_0
at any given point in time under normal traffic. The probability that five workers are talking to it at the same time is 0.001%
. If we only allow five workers to talk to a resource at any given point in time, and accept the 0.001%
false positive rate—we can fail the sixth worker attempting to check out a connection instantly. This means that while the five workers are waiting for a timeout, all the other W-5
workers on the node will instantly be failing on checking out the connection and opening their circuits. Our capacity is only degraded by a relatively small amount. We’ve built Semian to handle external resource acquisition across processes using SysV semaphores.
Summary
Combined with circuit breakers, the line of defense against misbehaving services then looks like this:
The RPC first checks the circuit; if the circuit is open it will raise the exception straight away which will trigger the fallback (the default fallback is a 500 response). Otherwise, it will try Semian which fails instantly if too many workers are already querying the resource. Finally the driver will query the data store. If the data store succeeds, the driver will return the data back to the RPC. If the data store is slow or fails, this is our last line of defense against a misbehaving resource. The driver will raise an exception after trying to connect with a timeout or after an immediate failure. These driver actions will affect the circuit and Semian, which can make future calls fail faster.
We’re pleased to have come such a long way with resiliency in 2014. Since using these techniques in Shopify, we have been able to survive network partitions and removed several single points of failure. We’ve built a great toolkit consisting of Semian, Circuit Breakers and Toxiproxy. As we continue to extract resiliency patterns such as circuit breakers from Shopify they’ll be open sourced under the Shopify organization on GitHub.
View original comments on this article