By Michelle Ark and Chris Wu
In January of 2014, Shopify built a data pipeline platform for the data science team called Starscream. Back then, we were a smaller team and needed a tool that could deal with everything from ad hoc explorations to machine learning models. We chose to build with PySpark to get the power of a generalized distributed computer platform, the backing of the industry standard, and the ability to tap into the Python talent market.
Fast forward six years and our data needs have changed. Starscream now runs 76,000 jobs and writes 300 terabytes a day. As we grew, some types of work went away, but others (like simple reports) became so commonplace we do them every day. While our Python tool based on PySpark was computationally powerful, it wasn’t optimized for these commonplace tasks. If a product manager needed a simple rollup for a new feature by country, pulling it, and modeling it wasn’t a fast task.
Below, we’ll dive into how we created a robust, production-ready SQL modeling workflow for building straightforward pipelines by leveraging dbt. We'll walkthrough how we built this system—which we call Seamster—and created tooling for testing and documentation on top of it.
The Problem
When we interviewed our users to understand their workflow on Starscream, there were two issues we discovered: development time and thinking.
Development time encompasses the time data scientists use to prototype the data model they’d like to build, run it, see the outcome, and iterate. The PySpark platform isn’t ideal for running straightforward reporting tasks, often forcing data scientists to write boilerplate code and yielding long runtimes. This leads to long iteration cycles when trying to build models on unfamiliar data.
The second issue, thinking, is more subtle and deals with the way the programming language forces you to look at the data. Many of our data scientists prefer SQL to python because its structure forces consistency in business metrics. When interviewing users, we found a majority would write out a query in SQL then translate it to Python when prototyping. Unfortunately, query translation is time consuming and doesn’t add value to the pipeline.
To understand how widespread these problems were, we audited the jobs run and surveyed our data science team for the use cases. We found that 70 percent or so of the PySpark jobs on Starscream were full batch queries that didn’t require generalized computing. We viewed this as an opportunity to make a kickass optimization for a painful workflow.
Enter Seamster
Our goal was to create a SQL pipeline for reporting that enables data scientists to create simple reporting faster, while still being production ready. After exploring a few alternatives, we felt that the dbt library came closest to our needs. Their tagline “deploy analytics code faster with software engineering practices” was exactly what we were looking for in a workflow. We opted to pair it with Google BigQuery as our data store and dubbed the system and its tools, Seamster.
We knew that any off-the-shelf system wouldn’t be one-size-fits-all. In moving to dbt, we had to implement our own:
- Source and model structure to modularize data model development
- Unit testing to increase the types of testable errors
- Continuous integration (CI) pipelines to provide safety and consistency guarantees
Source Independence and Safety
With dozens of data scientists making data models in a shared repository, a great user experience would:
- Maximize focus on work
- Minimize the impact of model changes by other data scientists
By default, dbt declares raw sources in a central sources.yml
. This quickly became a very large file as it included the schema for each source, in addition to the source name. This caused a huge bottleneck for teams editing the same file across multiple PRs.
To mitigate the bottleneck, we leveraged the flexibility of dbt and created a top-level ‘sources’ directory to represent each raw source with its own source-formatted YAML file. This way, data scientists can parse only the source documentation that’s relevant for them and contribute to the sources.yml
file without stepping on each other’s toes.
Base models are one-to-one interfaces to raw sources.
We also created a base layer of models using the ‘staging’ concept from dbt to implement their best practice of limiting references to raw data. Our base models serve as a one-to-one interface to raw sources. They don’t change the grain of the raw source, but do apply renaming, recasting, or any other cleaning operation that relates to the source data collection system.
The base layer serves to protect users from breaking changes in raw sources. Raw external sources are by definition out of the control of Seamster and can introduce breaking changes for any number of reasons, at any point in time. If and when this happens, users only need to apply the fix to the base model representing the raw source, as opposed to every individual downstream model that depends on the raw source.
Model Ownership for Teams
We knew that the tooling improvements of Seamster would only be one part of a greater data platform at Shopify. We wanted to make sure we’re providing mechanisms to support good dimensional modelling practices and support data discovery.
In dbt, a model is simply a .sql file. We’ve extended this definition in Seamster to define a model as a directory consisting of four files:
model_name.sql
schema.yml
README.md
test_model_name.py
Users can further organize models into directories that indicate a data science team at Shopify like Finance or Marketing.
To support a clean data warehouse, we also organized data models into layers that differentiate between:
- Base: data models that are one-to-one with raw data, but cleaned, recast, and renamed
- Application-ready: data that isn’t dimensionally modelled but still transformed and clean for consumption by another tool (for example, training data for a machine learning algorithm)
- Presentation: shareable and reliable data models that follow dimensional modelling best practices and can be used by data scientists across different domains
With these two changes, a data consumer can quickly understand the data quality they can expect from a model and find the owner in case there is an issue. We also pass this metadata upstream to other tools to help with the data discovery workflow.
More Tests
dbt has native support for ‘schema tests’, which are encoded in a model’s schema.yml file. These tests run against production data to validate data invariants, such as the presence of null values or the uniqueness of a particular key. This feature in dbt serves its purpose well, but we also want to enable data scientists to write unit tests for models that run against fixed input data (as opposed to production data).
Testing on fixed inputs allows the user to test edge cases that may not be in production yet. In larger organizations, there can and will be frequent updates and many collaborators for a single model. Unit tests give users confidence that the changes they’re making won’t break existing behaviour or introduce regressions.
Seamster provides a Python-based unit testing framework. Users can write their unit tests in the test_model_name.py
file in the model directory. The framework enables constructing ‘mock’ input models from fixed data. The central object in this framework is a ‘mock’ data model, which has an underlying representation of a Pandas dataframe. Users can pass fixed data to the mock constructor as either a csv-style string, Pandas dataframe, or a list of dictionaries to specify input data.
Input and expected MockModels are built from static data. The actual MockModel is built from input MockModels by BigQuery. Actual and expected MockModels can assert equality or any Great Expectations expectation.
A constructor creates a test query where a common table expression (CTE) represents each input mock data model, and any references to production models (identified using dbt’s ‘ref’ macro) are replaced by references to the corresponding CTE. Once a user executes a query, they can compare the output to an expected result. In addition to an equality assertion, we extended our framework to support all expectations from the open-source Great Expectations library to provide more granular assertions and error messaging.
The main downside to this framework is that it requires a roundtrip to the query engine to construct the test data model given a set of inputs. Even though the query itself is lightweight and processes only a handful of rows, these roundtrips to the engine add up. It becomes costly to run an entire test suite on each local or CI run. To solve this, we introduced tooling both in development and CI to run the minimal set of tests that could potentially break given the change. This was straightforward to implement with accuracy because of dbt’s lineage tracking support; we simply had to find all downstream models (direct and indirect) for each changed model and run their tests.
Schema and DAG Validation on the Cheap
Our objective in Seamster’s CI is to give data scientists peace of mind that their changes won’t introduce production errors the next time the warehouse is built. They shouldn’t have to wonder whether removing a column will cause downstream dependencies to break, or whether they made a small typo in their SQL model definition.
To achieve this accurately, we needed to build and tear down the entire warehouse on every commit. This isn’t feasible from both a time and cost perspective. Instead, on every commit we materialize every model as a view in a temporary BigQuery dataset which is created at the start of the validation process and removed as soon as the validation finishes. If we can’t build a view because its upstream model doesn’t provide a certain column, or if the SQL is invalid for any reason, BigQuery fails to build the view and produces relevant error messaging.
Currently, this validation step takes about two minutes. We reduce validation time further by only building the portion of the DAG affected by the changed models, as done in the unit testing approach.
dbt’s schema.yml serves purely as metadata and can contain columns with invalid names or types (data_type
). We employ the same view-based strategy to validate the contents of a model’s schema.yml file ensuring the schema.yml is an accurate depiction of the actual SQL model.
Data Warehouse Rules
Like many large organizations, we maintain a data warehouse for reporting where accuracy is key. To power our independent data science teams, Seamster helps by enforcing conformance rules on the layers mentioned earlier (base, application-ready, and presentation layers). Examples include naming rules or inheritance rules which help the user reason over the data when building their own dependent models.
Seamster CI runs a collection of such rules that ensure consistency of documentation and modelling practices across different data science teams. For example, one warehouse rule enforces that all columns in a schema conform to a prescribed nomenclature. Another warehouse rule enforces that only base models can reference raw sources (via the ‘source’ macro) directly.
Some warehouse rules apply only to certain layers. In the presentation layer, we enforce that any column name needs a globally unique description to avoid divergence of definitions. Since everything in dbt is YAML, most of this rule enforcement is just simple parsing.
So, How Did It Go?
To ensure we got it right and worked out the kinks, we ran a multiweek beta of Seamster with some of our data scientists who tested the system out on real models. Since you’re reading about it, you can guess by now that it went well!
While productivity measures are always hard, the vast majority of users reported they were shipping models in a couple of days instead of a couple of weeks. In addition, documentation of models increased due to the fact that this is a feature built into the model spec.
Were there any negative results? Well, there's always going to be things we can improve. dbt’s current incremental support doesn’t provide safe and consistent methods to handle late arriving data, key resolution, and rebuilds. For this reason, a handful of models (like type 2 dimensions or models in the over 1.5 billion event territory) that required incremental semantics weren’t doable—for now. We’ve got big plans to address this in the future.
Where to Next?
We’re focusing on updating the tool to ensure it’s tailored to Shopify data scientists. The biggest hurdle for a new product (internal and external) is adoption. We know we still have work to do to ensure that our tool is top of mind when users have simple (but not easy) reporting work. We’re spending time with each team to identify upcoming work that we can speed up by using Seamster. Their questions and comments will be part of our tutorials and documentations for new users.
On the engineering front, an exciting next step is looking beyond batch data processing. Apache Beam and Beam SQL provide an opportunity to consider a single SQL-centric data modelling tool for both batch and streaming use cases.
We’re also big believers in open source at Shopify. Depending on the dbt’s community needs, we’d also like to explore contributing our validation strategy and a unit testing framework to the project.
If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.