In distributed systems, there’s plenty of occasions for things to go wrong. This is why resiliency and redundancy are important. But no matter the systems you put in place, no matter whether you did or didn’t touch your deployments, issues might arise. It makes it critical to acknowledge the near misses: the situations where something could have gone wrong and the situations where something did, but it could have been worse. When was the last time it happened to you? For us, at Shopify, it was on September 30th, 2021, when the expiration of Let’s Encrypt’s (old) root certificate almost led to a global outage of our platform.
In April 2021, Let’s Encrypt announced that the former root certificate was expiring. As we use Let’s Encrypt as our public certificate provider since we became a sponsor in 2016, we made sure that Shopify’s edge infrastructure was up to date with the different requirements, so we wouldn’t stop serving traffic to all of (y)our beloved shops. As always, Let’s Encrypt did their due diligence with communications and by providing a cross-signing of their new root certificate by the old one. What this means is that while clients didn’t trust the new root certificate yet, because that new root certificate was signed by the old one, they trusted the old one and would transfer their trust to the new one. Also, the period of time between the announcement and the expiration was sufficient for any Let’s Encrypt-emitted certificates, which expire after three months, to be signed by the new cross-signed root certificate and considered valid using any of the old or new root certificates. We didn’t expect anything bad to happen on September 30th, 2021, when the root certificate was set to expire at 10:00 a.m. Eastern Standard Time.
At 10:15 am that same day, our monitors started complaining about certificate errors, but not at Shopify’s edge—that is, between the public and Shopify—but between our services. As a member of the traffic team of Shopify that handles a number of matters related to bringing traffic safely and reliably into Shopify’s platform (including the provisioning and handling of certificates), I joined the incident response to try and help figure out what was happening. Our first response was to lock the deployments of the Shopify monolith (using spy, our chatops) while some of us connected to containers to try and figure out what was happening in there. In the meantime, we started looking at the deployments that were happening when this issue started manifesting. It didn’t make any sense as those changes didn’t have anything to do with the way services interconnected, nor with certificates. This is when the Let’s Encrypt root certificate expiry started clicking in our minds. An incident showing certificate validity errors happening right after the expiry date couldn’t be a coincidence, but we couldn’t reproduce the error in our browsers or even using curl
. Using openssl
, we could, however, observe the certificate expiry for the old root certificate:
The error was related to the client being used for those connections. And we saw those errors appearing in multiple services across Shopify using different configurations and libraries. For a number of those services, the errors were bubbling up from the internally-built library allowing services to check people’s authentication to Shopify. While Faraday
is the library we generally use for HTTP connections, our internal library has dependencies on rack-oauth2
and openid_connect
. Looking at the dependency chains for both applications, we saw the following:
Both rack-oauth2
(directy) and openid_connect
(indirectly) depend on httpclient
, which, according to the GitHub repository of the library, “gives something like the functionality of libwww-perl (LWP) in Ruby.”
From other service errors, we identified that the google-api-client
also was in error. Using the same process, we pinpointed the same library as a dependency:
And so we took a closer look at httpclient
and...
Code snippet from httpclient/nahi
Uh-oh, that doesn’t look good. httpclient
is heavily used, whether it’s directly or through indirect exposures of the dependency chain. Like web browsers, httpclient
embeds a version of the root certificates. The main difference is that in this case, the version of the root certificate store in the library is six years old (!!) while reference root certificate stores are generally updated every few months. So even with Let’s Encrypt due diligence, a stale client store that doesn’t trust the new root certificate directly or the old one, as it expired, was sufficient to cause internal issues.
Our emergency fix was simple. We forked the Git repository, created a branch that overrode cacert.pem
with the most recent root certificate bundle and started using that branch in our deployments to make things work. After confirming the fix was working as expected and deploying it in our canaries, the problem was solved for the monolith. Then automation helped create pull requests for all our affected repositories.
The choice of overriding cacert.pem
with a more recent one is a temporary fix. However, it was the one, following a solve-fast approach, we knew would work automatically for all our deployments without making any other changes. To support this fix and make sure a similar issue does not happen soon, we put in place systems to keep track of changes in the root certificates, and automatically update them in our fork when needed. A better long-term approach could be to use the system root certificates store for instance, which can commence after a review of our system root certificate stores, across all of our runtime environments.
We wondered why it took about 15 minutes for us to start seeing the effects of that certificate expiry. The answer is actually in the trigger as we identified that we started seeing the issue on the Shopify monolith when a deployment happened. HTTP has a process of permanent connections, also called HTTP keep-alive, that keeps a connection alive as long as it’s being used, and only closes it when it hasn’t been used after a short period of time. Also, TLS validation, the check of the validity of certificates, is only performed while initializing the connection, but the trust is maintained for the duration of that connection. Given the traffic on Shopify, our deployments kept alive the connections to other systems, and the only reason those connections were broken was because of Kubernetes pods being recreated to deploy the new version, leading to a new HTTP connection and the failure of TLS validation—hence the 15 minutes discrepancy.
Beside our Ruby applications having (indirect!) dependencies on httpclient
, a few other of our systems were affected by the same problem. Particularly, services powered by data were left hanging as the application providing them with data was affected by the disruption. For instance, product recommendations weren’t shown during that time, marketing campaigns ended up being throttled temporarily, and, more visible to our merchants’ customers, order confirmations were delayed for a short period because the risk analysis couldn’t be performed.
Of the Shopify monolith, however, only the canaries—that is, the server to which we roll changes first to test their effect in production before rolling them to the rest of the fleet—were affected by the issue. Our incident response initial action of locking deployments also stops any deployment process in its current state. This simple action allowed us to avoid cycling Kubernetes pods for the monolith and keep the current version running, protecting us from a global outage of Shopify, leading to September 30th, 2021 being that one time an outage could have been way worse.
Raphaël is a Staff Production Engineer and the tech lead of the Traffic team at Shopify, taking care of the interfaces between Shopify and the outside world and providing reliable and scalable systems for configuring the edge of the neverending growing applications. He holds a Ph.D. in Computer Engineering in systems performance analysis and tracing, and sometimes gives lectures to future engineers, at Polytechnique Montréal, about Distributed Systems and Cloud Computing.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.