Data scientists at Shopify expect fast results when querying large datasets across multiple data sources. We use Trino (a distributed SQL query engine) to provide quick access to our data lake and recently, we’ve invested in speeding up our query execution time.
On top of handling over 500 Gbps of data, we strive to deliver p95 query results in five seconds or less. To achieve this, we’re constantly tuning our infrastructure. But with each change comes a risk to our system. A disruptive change could stall the work of our data scientists and interrupt our engineers on call.
That’s why Shopify’s Data Reliability team built custom verification, benchmarking, and profiling tooling for testing and analyzing Trino. Our tooling is designed to minimize the risk of various changes at scale.
Below we’ll walk you through how we developed our tooling. We’ll share simple concepts to use in your own Trino deployment or any other complex system involving frequent iterations.
As Shopify grows, so does our data and our team of data scientists. To handle the increasing volume, we’ve scaled our Trino cluster to hundreds of nodes and tens of thousands of virtual CPUs.
Managing our cluster gives way to two main concerns:
- Optimizations: We typically have several experiments on the go for optimizing some aspect of our configuration and infrastructure.
- Software updates: We must keep up-to-date with new Trino features and security patches.
Both of these concerns involve changes that need constant vetting, but the changes are further complicated by the fact that we run a fork of Trino. Our fork allows us more control over feature development and release schedules. The tradeoff is that we’re sometimes the first large organization to test new code “in the field,” and if we’re contributing back upstream to Trino, the new code must be even more thoroughly tested.
Changes also have a high cost of failure because our data scientists are interrupted and our engineers must manually roll back Trino to a working version. Due to increasing uncertainty on how changes might negatively affect our cluster, we decided to hit the brakes on our unstructured vetting process and go back to the drawing board. We needed a tool that could give us confidence in potential changes and increase the reliability of our system as a whole.
Identifying the Missing Tool
To help identify our missing tool, we looked at our Trino deployment as a Formula 1 race car. We need to complete each lap (or query) as fast as possible and shorten the timeline between research and production of our engine, all the while considering safety.
The highest-ranked Formula 1 teams have their own custom simulators. They put cars in real-life situations and use data to answer critical questions like, “How will the new steering system handle on the Grand Prix asphalt?” Teams also use simulations to prevent accidents from happening in real life.
Taking inspiration from this practice, we iterated on a few simulation prototypes:
- Replaying past queries. First, we built a tool to pluck a previously run query from our logs and “replay” it in an isolated environment.
- Replicating real life. Next, we wrote a tool that replicated traffic from a previous work day. Think of it like travelling back in time in our infrastructure.
- Standardizing our simulations. We also explored an official benchmarking framework to create controlled simulations.
Another thing we were missing in our “garage” was a good set of single-purpose gauges for evaluating a possible change to Trino. We had some manual checks in our heads, but they were undocumented, and some checks caused significant toil to complete. These could surely be formalized and automated.
We were now left with a mixed bag of ideas and prototypes that lacked structure.
We decided to address all our testing concerns within a single framework to build the structure that was lacking in our solution. Initially, we had three use cases for this framework: verification, benchmarking, and profiling of Trino.
The framework materialized as a lightweight Python library. It eliminates toil by extracting undocumented tribal knowledge into code, with familiar, intuitive interfaces. An interface may differ depending on the use case (verification, benchmarking, or profiling), but all rely on the same core code library.
The core library is a set of classes for Trino query orchestration. It’s essentially an API that has a shared purpose for all of our testing needs. The higher level
Library class handles connections, query states, and multithreaded execution or cancellation of queries. The
Query class handles more low level concerns, such as query annotations, safety checks, and fetching individual results. Our library makes use of the open source repository trino-python-client which implements the Python Database API Specification for Trino.
Verification: Accelerating Deployment
Verification consists of simple checks to ensure Trino still works as expected after a change. We use verification to accelerate the deployment cycle for a change to Trino.
We knew the future users of our framework (Data Platform engineers) have a development background, and a high likelihood of knowing Python. Associating verification with unit testing, we decided to leverage an existing testing framework as our main developer interface. Conceptually, PyTest's features fit our verification needs quite well.
We wrote a PyTest interface on top of our query orchestration library that abstracted away all the underlying Trino complications into a set of test fixtures. Now, all our verification concerns are structured into a suite of unit tests, in which the fixtures initialize each test in a repeatable manner and handle the cleanup after each test is done. We put a strong focus on testing standards, code readability, and ease of use.
Here’s an example block of code for testing our Trino cluster:
First, the test is marked. We’ve established a series of marks, so a user can run all “correctness” tests at a time, all “performance” tests at a time, or every test except the “production_only” ones. In our case, “correctness” means we’re expecting an exact set of rows to be returned given our query. “Correctness” and “verification” have interchangeable meanings here.
Next, a fixture (in this case,
candidate_cluster, which is a Trino cluster with our change applied) creates the connection, executes a query, fetches results, and closes the connection. Now, our developers can focus on the logic of the actual test.
Lastly, we call
run_query and run a simple assertion. With this familiar pattern, we can already check off a handful of our items on our undocumented Trino upgrade checklist.
Now, we increase the complexity:
@pytest.mark.performance. Although performance testing is a broad concept, by asserting on a simple threshold comparison we can verify performance of a given factor (for example, execution time) isn’t negatively impacted.
In this test, we call
multi_cluster, which runs the same query on two separate Trino clusters. We look for any differences between the “candidate” cluster we’re evaluating and a “standby” control cluster at the same time.
We used the
multi_cluster pattern during a Trino upgrade when verifying our internal Trino User Defined Functions (which are utilized in domains such as finance and privacy). We also used this pattern when assessing a candidate cluster’s proposed storage caching layer. Although our suite didn’t yet have the ability to assert on performance automatically, our engineer evaluated the caching layer with some simple heuristics and intuition after kicking off some tests.
We plan to use containerization and scheduling to automate our use cases further. In this scenario, we’d run verification tests at regular intervals and make decisions based on the results.
So far, this tool covers the “gauges” in our race car analogy. We can take a pit stop, check all the readings, and analyze all the changes.
Benchmarking: Simplifying Infrastructure
Formula 1 cars need to be aerodynamic, and they must direct air to the back engine for cooling. Historically, Formula 1 cars are benchmarked in a wind tunnel, and every design is tested in the same wind tunnel with all components closely monitored.
We took some inspiration from the practice of Formula 1 benchmarking. Our core library runs TPC-DS queries on sample datasets that are highly relevant to the nature of our business. Fortunately, Trino generates these datasets deterministically and makes them easily accessible.
The benchmarking queries are parametrized to run on multiple scale factors (
sf). We repeat the same test on increasing magnitudes of dataset size with corresponding amounts of load on our system. For example,
sf10 represents a 10 GB database, while
sf1000 represents 1 TB.
PyTest is just one interface that can be swapped out for another. But in the benchmarking case, it continues to work well:
This style of benchmarking is an improvement over our team members’ improvised methodologies. Some used Spark or Jupyter Notebooks, while others manually executed queries with SQL consoles. This led to inconsistencies, which was against the point of benchmarking.
We’re not looking to build a Formula 1 wind tunnel. Although more advanced benchmarking frameworks do exist, their architectures are more time-consuming to set up. At the moment, we’re using benchmarking for a limited set of simple performance checks.
Profiling and Simulations: Stability at Scale
Profiling refers to the instrumentation of specific scenarios for Trino, in order to optimize how the situations are handled. Our library enables profiling at scale, which can be utilized to make a large system more stable.
In order to optimize our Trino configuration even further, we need to profile highly specific behaviours and functions. Luckily, our core library enables us to write some powerful instrumentation code.
Notably, it executes queries and ignores individual results (we refer to these as background queries). When kicking off hundreds of parallel queries, which could return millions of rows of data, we’d quickly run out of memory on our laptops (or external Trino clients). With background queries, we put our cluster into overdrive with ease and profile on a much larger scale than before.
We formalized all our prototyped simulation code and brought it into our library, illustrated with the following samples:
generate_traffic is called with a custom profile to target a specific behaviour.
replay_queries plays back queries in real time to see how a modified cluster handles them. These methodologies cover edge cases that a standard benchmark test will miss.
This sort of profiling was used when evaluating an auto-scaling configuration for cloud resources during peak and off-hours. Although our data scientists live around the world, most of our queries happen between 9-5 PM EST, so we’re overprovisioned outside of these hours. One of our engineers experimented with Kubernetes’ horizontal pod autoscaling, kicking off simulated traffic to see how our count of Trino workers adjusted to different load patterns (such as “high to low” and “low to high”).
The Results and What’s Next
Building a faster and safer Trino is a platform effort supported by multiple teams. With this tooling, the Data Foundations team wrote an extensive series of correctness tests to prepare for the next Trino upgrade. The tests helped iron out issues and led to a successful upgrade! To bring back our analogy, we made an improvement to our race car, and when it left the garage, it didn’t crash. Plus, it maintained its speed—our P95 query execution time has remained stable over the past few months (the upgrade occurred during this window).
By using this tool, our team learned about the effectiveness of our experimental performance changes, such as storage caching or traffic-based autoscaling. We were able to make more informed decisions about what (or what not) to ship.
Another thing we learned along the way is that performance testing is complicated. Here are a few things to consider when creating this type of tooling:
- A solid statistics foundation is crucial. This helps ensure everyone is on the same page when sharing numbers, interpreting reports, or calculating service level indicators.
- Many nuances of an environment can unintentionally influence results. To avoid this, understand the system and usage patterns on a deep level, and identify all differences in environments (for example, “prod” vs. “dev”).
- Ensure you gather all the relevant data. Some of the data points we care about, such as resource usage, are late-arriving, which complicates things. Automation, containerization, and scheduling are useful for this sort of problem.
In the end, we scoped out most of our performance testing and profiling goals from this project, and focused specifically on verification. By ensuring that our framework is extensible and that our library is modular, we left an opportunity for more advanced performance, benchmarking, and profiling interfaces to be built into our suite in the future.
We’re excited to use this tool in several gameday scenarios to prepare our Data Reliability team for an upcoming high-traffic weekend—Black Friday and Cyber Monday—where business-critical metrics are needed at a moment’s notice. This is as good a reason as ever for us to formalize some repeatable load tests and stress tests, similar to how we test the Shopify system as a whole.
We’re currently evaluating how we can open-source this suite of tools and give back to the community. Stay tuned!
Interested in tackling challenging problems that make a difference? Visit our Data Science & Engineering career page to browse our open positions.
Noam is a Hacker in both name and spirit. For the past three years, he’s worked as a data developer focused on infrastructure and platform reliability projects. From Toronto, Canada, he enjoys biking and analog photography (sometimes at the same time).