Making commerce better for everyone is a challenge we face on a daily basis. For our Production Engineering team, it means ensuring that our 600,000+ merchants have a reliable and scalable platform to support their business needs. We need to be able to support everything our merchants throw at us—including the influx of holiday traffic during Black Friday and Cyber Monday (BFCM). All of this needs to happen without an interruption in service. We’re proud to say that the effort we put in to deploying, scaling, and launching new projects on a daily basis gives our merchants access to a platform with 99.98% uptime.
Black Friday Cyber Monday 2018
To put the impact of this into perspective, Black Friday and Cyber Monday is what we refer to as our World Cup. Each year, our merchants push the boundaries of our platform to handle more traffic and more sales. This year alone, merchants sold over $1.5 billion USD in sales throughout the holiday weekend.
What people may not realize is that Shopify is made up of many different internal services and interaction points with third-party providers, like payment gateways and shipping carriers. The performance and reliability of each of this dependencies can potentially affect our merchants and buyers in different ways. That’s why our Production Engineering teams preparations for BFCM run the entire gamut.
To increase the chances of success on BFCM Production Engineering run “game days” on our systems and their dependencies. Game days are a form of fault injection where we test our assumptions about the system by degrading its dependencies under controlled conditions. For example, we’ll introduce artificial latency into the code paths that interact with shipping providers to ensure that the system continues working and doing something reasonable. That could be, for instance, falling back to another third party or hard-coded defaults if a third party dependency were to become slow for any reason, or verifying that a particular service responds as expected to a problem with their main datastore.
Besides fault injection work, Production Engineering also run load testing exercises where volumes similar to what we expect during BFCM are created synthetically and sent to the different applications to ensure that the system and its components behave well under the onslaught of requests they’ll serve on BFCM.
At Shopify, we pride ourselves on continuous and fast deploys to deliver features and fixes as fast as we can; however, the rate of change on a system increases the probability of issues that can affect our users. During the ramp-up period for BFCM, we manage the normal cadence of the company by establishing both a feature freeze and a code freeze. The feature freeze starts several weeks before BFCM and means no meaningful changes to user-facing features are deployed to prevent changes on merchant’s workflows. At that point in the year, changes, even improvements can have an unacceptable learning curve for merchants that are diligently getting ready for the big event.
A few days before BFCM and during the event an actual code freeze is in effect, means that only critical fixes can be deployed and everything else must remain in stasis. The idea is to reduce the possibility of introducing bugs and unexpected system interactions that could cause the service to be compromised during the peak days of the holiday season.
Did all of our preparations work out? With BFCM in the rearview mirror, we can say, yes. This BFCM weekend was a record breaker for Shopify. We saw nearly 11,000 orders created per minute and around 100,000 requests per second being served for extended periods during the weekend. All and all, most system metrics followed a pattern of 1.8 times what they were in 2017.
The somewhat unsurprising conclusion is that running towards the risk by injecting faults, load testing, and role-playing possible disaster scenarios pays off. Also, reliability goes beyond your “own” system most complex platforms these days have to deal with third parties to provide the best service possible. We have learned to trust our partners but also understand that any system can have downtime and in the end, Shopify is responsible to our merchants and buyers.