Recently we launched Shopify Balance, a money management account and card that gives Shopify merchants quick access to their funds with no fees. After the beta launch of Shopify Balance, the Shopify Data team was brought in to answer the question: how do we reliably count the number of merchants using Balance? In particular, how do we count this historically?
While this sounds like a simple question, it’s foundationally critical to knowing if our product is a success and if merchants are actually using it. It’s also more complicated than it seems to answer.
To be considered as using Shopify Balance, a merchant has to have both an active Shopify Balance account and an active Shopify account. This means we needed to build something to track the state changes of both accounts simultaneously, and make that tracking robust and reliable over time. Enter double entry transition tables. While very much an “invest up front and save a ton of time in the long run” strategy, double entry transition tables give us the flexibility to see the individual inputs that cause a given change. It does all of this while simplifying our queries and reducing long term maintenance on our reporting.
In this post, we’ll explore how we built a data pipeline using double entry transition tables to answer our question: how many Shopify merchants are using Shopify Balance? We’ll go over how we designed something that scales as our product grows in complexity, the benefits of using double entry transition tables—from ease of use to future proofing our reporting—and some sample queries using our new table.
What Are Double Entry Transition Tables?
Double entry transition tables are essentially a data presentation format that tracks changes in attributes of entities over time. At Shopify, one of our first use cases of a double entry transition table was used to track the state of merchants using the platform, allowing us to report on how many merchants have active accounts. In comparison to a standard transition table that has from and to columns, double entry transition tables output two rows for each state change, along with a new net_change
column. They can also combine many individual tracked attributes into a single output.
It took me a long time to wrap my head around this net_change
column, but it essentially works like this: if you want to track the status of something over time, every time the status changes from one state to another or vice versa, there will be two entries:
-
net_change
= -1: this row is the previous state -
net_change
= +1: this row is the new state
Double entry transition tables have many advantages including:
-
The
net_change
column is additive: this is the true benefit of using this type of table. This allows you to quickly get the number of entities that are in a certain state by summing upnet_change
while filtering for the state you care about. - Identifying cause of change: for situations where you care about an overall status (one that depends on several underlying statuses), you can go into the table and see which of the individual attributes caused the change.
-
Preserving all timing information: the output preserves all timing information, and even correctly orders
transitions
that have identical timestamps. This is helpful for situations where you need to know something like the duration of a given status. -
Easily scaled with additional attributes: if the downstream dependencies are written correctly, you can add additional attributes to your table as the product you’re tracking grows in complexity. The bonus is that you don’t have to rewrite any existing SQL or PySpark, all thanks to the additive nature of the
net_change
column.
For our purpose of identifying how many merchants are using Shopify Balance, double entry transition tables allow us to track state changes for both the Shopify Balance account and the Shopify account in a single table. It also gives us a clean way to query the status of each entity over time. But how do we do this?
Building Our Double Entry Transition Pipelines
First, we need to prepare individual attribute tables to be used as inputs for our double entry transition data infrastructure. We need at least one attribute, but it can scale to any number of attributes as the product we’re tracking grows.
In our case, we created individual attribute tables for both the Shopify Balance account status and the Shopify account status. An attribute input table must have a specific set of columns:
-
a partition key that’s common across attribute, which in our case is an
account_id
-
a sort key, generally a
transition_at
timestamp and anindex
- an attribute you want to track.
Using a standard transition table, we can convert it to an attribute with a simple PySpark job:
Note the index
column. We created this index
using a row number window function, ordering by the transition_id
any time we have duplicate account_id
and transition_at
sets in our original data. While simple, it serves the role of a tiebreak should there be two transition events with identical timestamps. This ensures we always have a unique account_id
, transition_at
, index
set in our attribute for correct ordering of events. The index
plays a key role later on when we create our double entry transition table, ensuring we’re able to capture the order of our two states.
Now that we have our two attribute tables, it’s time to feed these into our double entry transition pipelines. This system (called build merge state transitions
) takes our individual attribute tables and first generates a combined set of unique rows using a partition_key
(in our case, the account_id
column), and a sort_key
(in our case, the transition_at
and index
columns). It then creates one column per attribute, and fills in the attribute columns with values from their respective tables, in the order defined by the partition_key
and sort_key
. Where values are missing, it fills in the table using the previous known value for that attribute. Below you can see two example attributes being merged together and filled in:
This table is then run through another process that creates our net_change
column and assigns a +1 value to all current rows. It also inserts a second row for each state change with a net_change
value of -1. This net_change
column now represents the direction of each state change as outlined earlier.
Thanks to our pipeline, setting up a double entry transition table is a very simple PySpark job:
Note in the code above we’ve specified default values. These are used to fill in the initial null values for the attributes. Now below is the output of our final double entry transition table, which we call our
table. The table captures both a merchant’s Shopify and Shopify Balance account statuses over time. Looking at the accounts_transition_facts
shopify_status
column, we can see they went from inactive
to active
in 2018, while the balance_status
column shows us that they went from not_on_balance
to active
on March 14, 2021, and subsequently from active
to inactive
on April 23, 2021:
Using Double Entry Transition Tables
Remember how I mentioned that the net_change
column is additive? This makes working with double entry transition tables incredibly easy. The ability to sum the net_change
column significantly reduces the SQL or PySpark needed to get counts of states. For example, using our new account_transition_facts
table, we can identify the total number of active
accounts on Shopify Balance, using both the Shopify Balance status and Shopify status. All we have to do is sum our net_change
column while filtering for the attribute statuses we care about:
Add in a grouping on a date column and we can see the net change in accounts over time:
We can even use the output in other PySpark jobs. Below is an example of a PySpark job consuming the output of our account_transition_facts
table. In this case, we are adding the daily net change in account numbers to an aggregate daily snapshot table for Shopify Balance:
There are many ways you can achieve the same outputs using SQL or PySpark, but having a double entry transition table in place significantly simplifies the code at query time. And as mentioned earlier, if you write the code using the additive net_change
column, you won’t need to rewrite any SQL or PySpark when you add more attributes to your double entry transition table.
We won’t lie, it took a lot of time and effort to build the first version of our account_transition_facts
table. But thanks to our investment, we now have a reliable way to answer our initial question: how do we count the number of merchants using Balance? It’s easy with our double entry transition table! Grouping by the status we care about, simply sum net_change
and viola, we have our answer.
Not only does our double entry transition table simply and elegantly answer our question, but it also easily scales with our product. Thanks to the additive nature of the net_change
column, we can add additional attributes without impacting any of our existing reporting. This means this is just the beginning for our account_transition_facts
table. In the coming months, we’ll be evaluating other statuses that change over time, and adding in those that make sense for Shopify Balance into our table. Next time you need to reliably count multiple states, try exploring double entry transition tables.
Justin Pauley is a Data Scientist working on Shopify Balance. Justin has a passion for solving complex problems through data modeling, and is a firm believer that clean data leads to better storytelling. In his spare time he enjoys woodworking, building custom Lego creations, and learning new technologies. Justin can be reached via LinkedIn.
Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.
If you’re interested in building solutions from the ground up and would like to come work with us, please check out Shopify’s career page.