By Berkay Antmen, Chris Wu, and Dave Sugden
In 2017, various teams at Shopify came together to build an external-facing live-streamed visualization of all the sales made by Shopify merchants during the Black Friday and Cyber Monday (BFCM) weekend. We call it the Shopify BFCM live map.
Shopify’s BFCM live map is a visual signal of the shift in consumer spending towards independent businesses and our way to celebrate the power of entrepreneurship. Over the years, it’s become a tradition for different teams within Shopify to iterate on the live map to see how we can better tell this story. Because of our efforts, people all over the world can watch our merchant sales in real-time, online, broadcast on television, and even in Times Square.
This year, the Shopify Data Platform Engineering team played a significant role in the latest iteration of the BFCM live map. Firstly, we sought to explore what new insights we could introduce and display on the live map. Secondly, and most importantly, we needed to figure out a way to scale the live map. Last year we had more than 1 million merchants. That number has grown to over 1.7 million. With just weeks left until BFCM, we were tasked with not only figuring out how to address the system’s scalability issues but also challenging ourselves to do so in a way that would help us create patterns we could repeat elsewhere in Shopify.
We’ll dive into how our team, along with many others, revamped the data infrastructure powering our BFCM live map using Apache Flink. In a matter of weeks, we created a solution that displayed richer insights and processed a higher volume of data at a higher uptime—all with no manual interventions.
Last Year’s Model Had Met Its Limit
Last year’s live map drew a variety of transaction data and metadata types from our merchants. The live map looked amazing and did the job, but now with more than 1.7 million merchants on our platform, we weren’t confident that the backend architecture supporting it would be able to handle the volume predicted for 2021.
With just weeks until BFCM, Shopify execs challenged us to “see if we know our systems” by adding new metrics and scaling the live map.
In this ask, the Shopify Data Platform Engineering team saw an opportunity. We have an internal consulting team that arose organically to assist Shopify teams in leveraging our data stack. Lately, they'd been helping teams adopt stateful stream processing technologies. Streaming is still a developing practice at Shopify, but we knew we could tap this team to help us use this technology to scale the BFCM live map. With this in mind, we met with the Marketing, Revenue, UX, Product, and Engineering teams, all of whom were equally invested in this project, to discuss what we could accomplish in advance of BFCM.
Deconstructing Last Year’s Model
We started by taking stock of the system powering the 2020 live map. The frontend was built with React and a custom 3D visualization library. The backend was a home-grown, bespoke stateful streaming service we call Cricket, built in Go. Cricket processes messages from relevant Kafka topics and broadcasts metrics to the frontend via Redis.
Our biggest concern was that this year Cricket could be overloaded with the volume coming from the checkout Kafka topic. To give you an idea of what that volume looked like, at the peak we saw roughly 50,000 messages per second during the 2021 BFCM weekend. On top of volume concerns, our Kafka topic contains more than just the subset of events that we need, and those events contain fields we didn’t intend to use.
Another challenge we faced was that the connection between Cricket and the frontend had a number of weaknesses. The original authors were aware of these, but there were trade-offs they’d made to get things ready in time. We were using Redis to queue up messages and broadcast our metrics to browsers, which was inefficient and relatively complex. The metrics displayed on our live map have more relaxed requirements around ordering than, say, chat applications where message order matters. Instead, our live map metrics:
- Can tolerate some data loss: If you take a look at the image above of last year’s live map, you’ll see arc visuals that represent where an order is made and shipping to. These data visualizations are already sampled because we’re not displaying every single order on the browser (it would be too many!). So it’s okay if we lose some of the arc visuals because we’re unable to draw all arcs on the screen anyways.
- Only require the latest value: While Cricket relays near real-time updates, we’re only interested in displaying the latest statistics for our metrics. Last year those metrics included sales per minute, orders per minute, and our carbon offset. Queuing up and publishing the entire broadcasted history for these metrics would be excessive.
This year, on top of the metrics listed above, we sought to add in:
- Product trends: Calculated as the top 500 categories of products with the greatest change in sale volume over the last six hours.
- Unique shoppers: Calculated as unique buyers per shop, aggregated over time.
Now in our load tests, we observed that Redis would quickly become a bottleneck due to the increase in the number of published messages and subscribers or connections. This would cause the browser long polling to sometimes hang for too long, causing the live map arc visuals to momentarily disappear until getting a response. We needed to address this because we forecasted that this year there would be more data to process. After talking to the teams who built last year’s model, and evaluating what existed, we developed a plan and started building our solution.
The 2021 Solution
At a minimum, we knew that we had to deliver a live map that scaled at least as well as last year’s, so we were hesitant to go about changing too much without time to rigorously test it all. In a way this complicated things because while we might have preferred to build from scratch, we had to iterate upon the existing system.
In our load tests, with 1 million checkout events per second at peak, the Flink pipeline was able to operate well under high volume. We decided to put Flink on the critical path to filter out irrelevant checkout events and resolve the biggest issue—that of Cricket failing to scale. By doing this, Cricket was able to process one percent of the event volume to compute the existing metrics, while relying on Flink for the rest.
Due to our high availability requirements for the Flink jobs, we used a mixture of cross-region sharding and cross-region active-active deployment. Deduplications were handled in Cricket. For the existing metrics, Cricket continued to be the source of computation and for the new metrics, computed by Flink, Cricket acted as a relay layer.
For our new product trends metric, we leveraged our product categorization algorithm. We emitted 500 product categories with sales quantity changes, every five minutes. For a given product, the sales quantity percentage change was computed based on the following formula:
change = SUM(prior 1hr sales quantity) / MEAN(prior 6hr sales quantity) - 1
For all product trends job at a high level:
So How Did It Do?
Pulling computation out of Cricket into Flink proved to be the right move. Those jobs ran with 100 percent uptime throughout BFCM without backpressure and required no manual intervention. To mitigate risk, we also implemented the new metrics as batch jobs on our time-tested Spark infrastructure. While these jobs ran well, we ended up not relying on them because Flink met our expectations.
Here’s a look at what we shipped:
In the end, user feedback was positive, and we processed significantly more checkout events, as well as produced new metrics.
However, not everything went as smoothly as planned. The method that we used to fetch messages from Redis and serve them to the end users caused high CPU loads on our machines. This scalability issue was compounded by Cricket producing metrics at a faster rate and our new product trends metric clogging Redis with its large memory footprint.
A small sample of users noticed a visual error: some of the arc visuals would initiate, then blip out of existence. With the help of our Production Engineering team, we dropped some of the unnecessary Redis state and quickly unclogged it within two hours.
Despite the hiccup, the negative user impact was minimal. Flink met our high expectations, and we took notes on how to improve the live map infrastructure for the next year.
Planning For Next Year
With another successful BFCM through, the internal library we built for Flink enabled our teams to assemble sophisticated pipelines for the live map in a matter of weeks, proving that we can run mission-critical applications on this technology.
Beyond BFCM, what we’ve built can be used to improve other Shopify analytic visualizations and products. These products are currently powered by batch processing and the data isn’t always as fresh as we’d like. We can’t wait to use streaming technology to power more products that help our merchants stay data-informed.
As for the next BFCM, we’re planning to simplify the system powering the live map. And, because we had such a great experience with it, we’re looking to use Flink to handle all of the complexity.
This new system will enable us to:
- no longer have to maintain our own stateful stream processor
- remove the bottleneck in our system
- only have to consider back pressure at a single point (versus having to handle back pressure in our streaming jobs, in Cricket, and between Cricket and Web).
We are exploring a few different solutions, but the following is a promising one:
The above design is relatively simple and satisfies both our scalability and complexity requirements. All of the metrics would be produced by Flink jobs and periodically snapshotted in a database or key-value store. The Web tier would then periodically synchronize its in-memory cache and serve the polling requests from the browsers.
Overall, we’re pleased with what we accomplished and excited that we have such a head start on next year’s design. Our platform handled record-breaking sales over BFCM and commerce isn't slowing down. Want to help us scale and make commerce better for everyone? Join our team.
Berkay Antmen leads the Streaming Capabilities team under Data Platform Engineering. He’s interested in computational mathematics and distributed systems. His current Shopify mission is to make large-scale near real-time processing easy. Follow Berkay on Twitter.
Chris Wu is a Product Lead who works on the Data Platform team. He focuses on making great tools to work with data. In his spare time he can be found buying really nice notebooks but never actually writing in them.
Dave Sudgen is a Staff Data Developer who works on the Customer Success team, enabling Shopifolk to onboard to streaming technology.
Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.