Application databases are generally designed to only track current state. For example, a typical user’s data model will store the current settings for each user. This is known as a Type 1 dimension. Each time they make a change, their corresponding record will be updated in place:
id |
feature_x_enabled |
created_at |
updated_at |
1 |
true |
2019-01-01 12:14:23 |
2019-01-01 12:14:23 |
2 |
false |
2019-01-01 15:21:45 |
2019-01-02 05:20:00 |
This makes a lot of sense for applications. They need to be able to rapidly retrieve settings for a given user in order to determine how the application behaves. An indexed table at the user grain accomplishes this well.
But, as analysts, we not only care about the current state (how many users are using feature “X” as of today), but also the historical state. How many users were using feature “X” 90 days ago? What is the 30 day retention rate of the feature? How often are users turning it off and on? To accomplish these use cases we need a data model that tracks historical state:
id |
feature_x_enabled |
valid_from |
valid_to |
is_current |
1 |
true |
2019-01-01 12:14:23 |
2019-01-01 12:14:23 |
true |
2 |
true |
2019-01-01 15:21:45 |
2019-01-02 05:20:00 |
false |
2 |
false |
2019-01-02 05:20:00 |
|
true |
This is known as a Type 2 dimensional model. I’ll show how you can create these data models using modern ETL tooling like PySpark and dbt (data build tool).
Implementing Type 2 Dimensional Models at Shopify
I currently work as a data scientist in the International product line at Shopify. Our product line is focused on adapting and scaling our product around the world. One of the first major efforts we undertook was translating Shopify’s admin in order to make our software available to use in multiple languages.
Shopify admin translated
At Shopify, data scientists work across the full stack—from data extraction and instrumentation, to data modelling, dashboards, analytics, and machine learning powered products. As a product data scientist, I’m responsible for understanding how our translated versions of the product are performing. How many users are adopting them? How is adoption changing over time? Are they retaining the new language, or switching back to English? If we default a new user from Japan into Japanese, are they more likely to become a successful merchant than if they were first exposed to the product in English and given the option to switch? In order to answer these questions, we first had to figure out how our data could be sourced or instrumented, and then eventually modelled.
The functionality that decides which language to render Shopify in is based on the language
setting our engineers added to the users
data model.
id |
language |
… |
created_at |
updated_at |
1 |
en |
|
2019-01-01 12:14:23 |
2019-06-01 07:15:03 |
2 |
ja |
|
2019-02-02 11:00:35 |
2019-02-02 11:00:35 |
… |
… |
… |
… |
… |
User 1 will experience the Shopify admin in English, User 2 in Japanese, etc... Like most data models powering Shopify’s software, the users
model is a Type 1 dimension. Each time a user changes their language, or any other setting, the record gets updated in place. As I alluded to above, this data model doesn’t allow us to answer many of our questions as they involve knowing what language a given user is using at a particular point in time. Instead, we needed a data model that tracked user’s languages over time. There are several ways to approach this problem.
Options For Tracking State
Modify Core Application Model Design
In an ideal world, the core application database model will be designed to track state. Rather than having a record be updated in place, the new settings are instead appended as a new record. Due to the fact that the data is tracked directly in the source of truth, you can fully trust its accuracy. If you’re working closely with engineers prior to the launch of a product or new feature, you can advocate for this data model design. However, you will often run into two challenges with this approach:
- Engineers will be very reluctant to change the data model design to support analytical use cases. They want the application to be as performant as possible (as should you), and having a data model which keeps all historical state is not conducive to that.
- Most of the time, new features or products are built on top of pre-existing data models. As a result, modifying an existing table design to track history will come with an expensive and risky migration process, along with the aforementioned performance concerns.
In the case of rendering languages for the Shopify admin, the language
field was added to the pre-existing users
model, and updating this model design was out of the question.
Stitch Together Database Snapshots
System that extracts newly created or updated records from the application databases on a fixed schedule
At most technology companies, snapshots of application database tables are extracted into the data warehouse or data lake. At Shopify, we have a system that extracts newly created or updated records from the application databases on a fixed schedule.
Using these snapshots, one can leverage them as an input source for building a Type 2 dimension. However, given the fixed schedule nature of the data extraction system, it is possible that you will miss updates happening between one extract and the next.
If you are using dbt for your data modelling, you can leverage their nice built-in solution for building Type 2’s from snapshots!
Add Database Event Logging
Newly created or updated record is stored in this log in Kafka
Another alternative is to add a new event log. Each newly created or updated record is stored in this log. At Shopify, we rely heavily on Kafka as a pipeline for transferring real-time data between our applications and data land, which makes it an ideal candidate for implementing such a log.
If you work closely with engineers, or are comfortable working in your application codebase, you can get new logging in place that will stream any new or updated record to Kafka. Shopify is built on the Ruby on Rails web framework. Rails has something called “Active Record Callbacks”, which allows you to trigger logic before or after an alternation of an object’s (read “database records”) state. For our use case, we can leverage the after_commit
callback to log a record to Kafka after it has been successfully created or updated in the application database.
While this option isn’t perfect, and comes with a host of other caveats I will discuss later, we ended up choosing it as it was the quickest and easiest solution to implement that provided the required granularity.
Type 2 Modelling Recipes
Below, I’ll walk through some recipes for building Type 2 dimensions from the event logging option discussed above. We’ll stick with our example of modelling user’s languages over time and work with the case where we’ve added event logging to our database model from day 1 (i.e. when the table was first created). Here’s an example of what our user_update event log would look like:
id |
language |
created_at |
updated_at |
1 |
en |
2019-01-01 12:14:23 |
2019-01-01 12:14:23 |
2 |
en |
2019-02-02 11:00:35 |
2019-02-02 11:00:35 |
2 |
fr |
2019-02-02 11:00:35 |
2019-02-02 12:15:06 |
2 |
fr |
2019-02-02 11:00:35 |
2019-02-02 13:01:17 |
2 |
en |
2019-02-02 11:00:35 |
2019-02-02 14:10:01 |
This log describes the full history of the users data model.
- User 1 gets created at
2019-01-01 12:14:23
with English as the default language. - User 2 gets created at
2019-02-02 11:00:35
with English as the default language. - User 2 decides to switch to French at
2019-02-02 12:15:06
. - User 2 changes some other setting that is tracked in the users model at
2019-02-02 13:01:17
. - User 2 decides to switch back to English at
2019-02-02 14:10:01
.
Our goal is to transform this event log into a Type 2 dimension that looks like this:
id |
language |
valid_from |
valid_to |
is_current |
1 |
en |
2019-01-01 12:14:23 |
|
true |
2 |
en |
2019-02-02 11:00:35 |
2019-02-02 12:15:06 |
false |
2 |
fr |
2019-02-02 12:15:06 |
2019-02-02 14:10:01 |
false |
2 |
en |
2019-02-02 14:10:01 |
|
true |
We can see that the current state for all users can easily be retrieved with a SQL query that filters for WHERE is_current
. These records also have a null
value for the valid_to
column, since they are still in use. However, it is common practice to fill these nulls with something like the timestamp at which the job last ran, since the actual values may have changed since then.
PySpark
Due to Spark’s ability to scale to massive datasets, we use it at Shopify for building our data models that get loaded to our data warehouse. To avoid the mess that comes with installing Spark on your machine, you can leverage a pre-built docker image with PySpark and Jupyter notebook pre-installed. If you want to play around with these examples yourself, you can pull down this docker image with docker pull jupyter/pyspark-notebook:c76996e26e48
and then run docker run -p 8888:8888 jupyter/pyspark-notebook:c76996e26e48
to spin up a notebook where you can run PySpark locally.
We’ll start with some boiler plate code to create a Spark dataframe containing our sample of user update events:
With that out of the way, the first step is to filter our input log to only include records where the columns of interest were updated. With our event instrumentation, we log an event whenever any record in the users
model is updated. For our use case, we only care about instances where the user’s language
was updated (or created for the first time). It’s also possible that you will get duplicate records in your event logs, since Kafka clients typically support “at-least-once” delivery. The code below will also filter out these cases:
We now have something that looks like this:
id |
language |
updated_at |
1 |
en |
2019-01-01 12:14:23 |
2 |
en |
2019-02-02 11:00:35 |
2 |
fr |
2019-02-02 12:15:06 |
2 |
en |
2019-02-02 14:10:01 |
The last step is fairly simple; we produce one record per period for which a given language was enabled:
id |
language |
valid_from |
valid_to |
is_current |
1 |
en |
2019-01-01 12:14:23 |
2020-05-23 00:56:49 |
true |
2 |
en |
2019-02-02 11:00:35 |
2019-02-02 12:15:06 |
false |
2 |
fr |
2019-02-02 12:15:06 |
2019-02-02 14:10:01 |
false |
2 |
en |
2019-02-02 14:10:01 |
2020-05-23 00:56:49 |
true |
dbt
dbt is an open source tool that lets you build new data models in pure SQL. It’s a tool we are currently exploring using at Shopify to supplement modelling in PySpark, which I am really excited about. When writing PySpark jobs, you’re typically taking SQL in your head, and then figuring out how you can translate it to the PySpark API. Why not just build them in pure SQL? dbt lets you do exactly that:
With this SQL, we have replicated the exact same steps done in the PySpark example and will produce the same output shown above.
Gotchas, Lessons Learned, and The Path Forward
I’ve leveraged the approaches outlined above with multiple data models now. Here are a few of the things I’ve learned along the way.
1. It took us a few tries before we landed on the approach outlined above.
In some initial implementations, we were logging the record changes before they had been successfully committed to the database, which resulted in some mismatches in the downstream Type 2 models. Since then, we’ve been sure to always leverage the after_commit
callback based approach.
2. There are some pitfalls with logging changes from within the code:
-
Your event logging becomes susceptible to future code changes. For example, an engineer refactors some code and removes the
after_commit
call. These are rare, but can happen. A good safeguard against this is to leverage tooling like theCODEOWNERS
file, which notifies you when a particular part of the codebase is being changed.
-
You may miss record updates that are not triggered from within the application code. Again, these are rare, but it is possible to have an external process that is not using the Rails
User
model when making changes to records in the database.
3. It is possible to lose some events in the Kafka process.
For example, if one of the Shopify servers running the Ruby code were to fail before the event was successfully emitted to Kafka, you would lose that update event. Same thing if Kafka itself were to go down. Again, rare, but nonetheless something you should be willing to live with. There are a few ways you can mitigate the impact of these events:
- Have some continuous data quality checks running that compare the Type 2 dimensional model against the current state and checks for discrepancies.
- If & when any discrepancies are detected, you could augment your event log using the current state snapshot.
4. If deletes occur in a particular data model, you need to implement a way to handle this.
Otherwise, the deleted events will be indistinguishable from normal create or update records with the logging setup I showed above. Here are some ways around this:
- Have your engineers modify the table design to use soft deletes instead of hard deletes.
- Add a new field to your Kafka schema and log the type of event that triggered the change, i.e. (
create
,update
, ordelete
), and then handle accordingly in your Type 2 model code.
Implementing Type 2 dimensional models for Shopify’s admin languages was truly an iterative process and took investment from both data and engineering to successfully implement. With that said, we have found the analytical value of the resulting Type 2 models well worth the upfront effort.
Looking ahead, there’s an ongoing project at Shopify by one of our data engineering teams to store the MySQL binary logs (binlogs) in data land. Binlogs are a much better source for a log of data modifications, as they are directly tied to the source of truth (the MySQL database), and are much less susceptible to data loss than the Kafka based approach. With binlog extractions in place, you don’t need to add separate Kafka event logging to every new model as changes will be automatically tracked for all tables. You don’t need to worry about code changes or other processes making updates to the data model since the binlogs will always reflect the changes made to each table. I am optimistic that with binlogs as a new, more promising source for logging data modifications, along with the recipes outlined above, we can produce Type 2s out of the box for all new models. Everybody gets a Type 2!
Additional Information
SQL Query Recipes
Once we have our data modelled as a Type 2 dimension, there are a number of questions we can start easily answering:
Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.