Shop is Shopify’s flagship shopping app. It lets anyone track their packages, find new products, and even plant trees to offset the carbon emissions from their purchases. Since launching in 2019, we’ve quickly grown to serve tens of millions of users. One of the most heavily used features of the app is the home page. It’s the first thing users see when they open Shop. The home page keeps track of people’s orders from the time they click the checkout button to when it’s delivered to their door. This feature brings so much value to our users, but we’ve had some technical challenges scaling this globally. As a result of Shop’s growth, the home feed was taking up a significant amount of our total database load, and was starting to have a user-facing impact.
We prototyped a few solutions to fix this load issue and ended up building a custom write-through cache for the home feed. This was a huge success—after about six weeks of engineering work, we built a custom caching solution that reduced database load by 15% and overall app latency by about 20%.
Identifying The Problem
The main screen of the Shop app is the most used feature. Serving the feed is complex as it requires aggregating orders from millions of Shopify and non-Shopify merchants in addition to handling tracking data from dozens of carriers. Due to a variety of factors, loading and merging this data is both computationally expensive and quite slow. Before we started this project, 30% of Shop’s database load was from the home feed. This load didn’t only affect the home feed, it affected performance on all aspects of the application.
We looked around for simple, straightforward solutions to this problem like introducing IdentityCache, updating our database schema, and adding more database indexes. After some investigation, we learned that we had little database-level optimization left to do and no time to embark on a huge code rewrite. Caching, on the other hand, seemed ideal for this situation. Because users use the home feed every day and the temporal based sort of the home feed, home feed data was usually only read after it was recently written, making it ideal for a cache of some sort.
Finding a Solution
Because of the structure of the home feed, we couldn't use a plug-and-play caching solution. We think of a given user’s home feed as a sorted list of a user’s purchases, where the list can be large (some people do a lot of shopping!). The list can be updated by a series of concurrent operations that include:
- adding a new order to display on the home feed (for example, when someone makes a purchase from a Shopify store)
- updating the details associated with an order (for example, when the order is delivered)
- removing an order from the list (for example, when a user manually archives the order).
In order to cache the home feed, we’d need a system that maintains a cached version of a user’s feed, while handling arbitrary updates to the orders in the feed and also maintaining the guarantee that the feed order is correct.
Another geeky data point, @shop processed ~1 billion order emails in 24 hours yesterday. That’s 7x more than last year. 99% of them were processed in 40 milliseconds or less. @siavashg and team are the best! pic.twitter.com/HbV65OIOM1
— Jean-Michel Lemieux (@jmwind) November 28, 2020
Due to the quantity of updates that we process, it’s infeasible to use a read-through cache that’s invalidated after every write, as the cache would end up being invalidated so often it would be practically useless. After some research, we didn’t find an existing solution that:
- wasn’t invalidated after writes
- could handle failure cases without showing stale data to users.
So, we built one ourselves.
Building Shop’s Caching Solution
Rather than querying the database every time a user requests the home feed, we now cache a copy of their home feed in a fast, distributed, horizontally-scaled caching system (we chose Memcached). Then we serve from the cache rather than the database at request time provided certain conditions are met. To keep the cache valid and correct before each database update, we mark the cache as “invalid” to ensure the cached data isn’t used while the cache and database are out of sync. After the write is complete, we update the cache with the new data and mark it as “valid” again.
Deciding on Memcached
At Shopify, we use two different technologies for caching: Memcached and Redis. Redis is more powerful than Memcached, supporting more complex operations and storing more complex objects. Memcached is simpler, has less overhead, and is more widely used for caching inside Shop. While we use Redis for managing queues and some caches, we didn’t need Redis’ complexity, so we chose a distributed Memcached.
The primary issue we had to solve was ensuring the cache never contained stale records. We minimize the chance of cache invalidation by building the cache using a write-through invalidation policy that invalidates the cache before a database write and revalidates it after the successful write. That led to the next hard question: how do we actually store the data in Memcached and handle concurrent updates?
The naive approach would be to store a single key for each user in Memcached that maps a user to their home feed. Then, on write, invalidate the cache by evicting the key from the cache, make the database update, and finally revalidate the cache by writing the key again. The issue, unfortunately, is that there’s no support for concurrent writes. At Shop’s scale, multiple worker machines often concurrently process order updates for the same user. Using a delete-then-write strategy introduces race conditions that could lead to an incorrect cache, which is unacceptable.
To support concurrent writes, we store an additional key/value pair (pending writes key) that tracks the validity of the cache for each user. The key stores the number of active writes to a given user’s home feed. Each time a worker machine is about to write to the database, we increment this value. When the update is complete, we decrement the value. This means the cache is valid when the pending writes key is zero.
However, there’s one final case. What happens if a machine makes a database update and fails to decrement the pending writes key due to an interrupt or exception? How can we know if the pending writes key is greater than zero because there's currently a database write in progress or a process was interrupted?
The solution is introducing a key with a short expiry that’s written before any database update. If this key exists, then we know there’s the possibility of a database update, but if it doesn’t and the pending writes key is greater than zero, we know there’s no active database write occurring, so it’s safe to rewarm the cache again.
Another interesting detail is that we needed this code to work with all of our existing code in Shop and interplay seamlessly with that code. We wrote a series of Active Record Concerns that we mixed into the relevant database records. Using Active Record concerns meant that the ORM’s API stayed exactly the same, causing this change to be totally transparent to developers, and ensuring that all of this code was forward compatible. When Shop Pay became available to anyone selling on Google or Facebook, we were able to integrate the caching with minimal overhead.
The Rollout Strategy
Another important piece of this project was the rollout. Once we’d built the caching logic and integrated it with the ORM, we had to ship the cache to users. Theoretically sound, unit-tested code is a good first step, but without real world data, we weren’t confident enough in our system to deploy this cache without strict testing. We wanted to validate our hypothesis that it would never serve stale data to users.
So, over the course of the few weeks, we ran an experiment. First, we turned on all the cache writing and updating logic (but not the logic to serve data from the cache) and tested at scale. Once we knew that system was durable and scalable, we tested its correctness. At home feed serve time, our backend loaded from both the cache and the database and compared their data, and would log to a dashboard if there was a discrepancy. After letting this experiment run for a few weeks and fixing the issues that arose, we were able to be confident in our system’s correctness and scalability. We knew that the cache was always going to be valid and would not serve users stale or incorrect data.
After rolling this cache out globally, we saw immediate, impactful results. Our database servers have a lighter load, In addition to the lower database load and faster home feed performance, we also observed a double digit decrease in overall CPU usage and a 20% decrease in our overall GraphQL latency. Our database servers have a lighter load, our users have a faster experience, and our developers don’t need to worry about high database load. It’s a win-win-win.
Ryan Ehrlich is a software engineer living in Palo Alto, California. He focuses on solving problems in large scale, distributed systems, and CV/NLP AI research. Outside of work, he’s an avid rock climber, cyclist, and hiker.
Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.