The Complex Data Models Behind Shopify's Tax Insights Feature

A business’s taxes can be difficult to manage, especially in the United States. Tax laws are complicated and vary state-to-state, city-to-city, and product-to-product, further adding to the complexity.

When merchants start their businesses or as their businesses grow, it’s a challenge to figure out when they’re liable for sales tax—or when they have nexus. In 2018, the U.S. Supreme Court overturned a longstanding law that only required retailers to collect and remit sales tax if they had a physical presence in a state. Now, most states require out-of-state businesses to collect sales tax if they hit certain thresholds.

Shopify Tax provides state-level tax information to each merchant within their store’s admin page, and informs them when they become liable, or are approaching liability, to collect sales taxes in each state they have sales. It ultimately saves time worrying about taxes and audits so merchants can focus on growing their businesses.

Since we launched the Tax Insights feature as part of Shopify Tax, it has helped merchants stay more on top of their tax compliance than ever before.

The entire product entailed intensive data work behind the scenes. It included modifying several existing data models, creating four new ones, building in functionality to handle dynamically changing data, and publishing insights to a key-value store that subsequently gets surfaced to the end user.

As it turns out, surfacing a few extra data points to an end-user can have a significant positive impact. But what exactly did this entail? Is it just a simple number-crunching Spark job that queries a table containing hard-coded tax laws, does some calculations, and spits out data in your UI? What if you want to also use this data to create a machine learning model or for analytics to further drive business impact? This post will explain the work involved in achieving this impact, from complex data modeling to sending data to the application’s backend. We’ll walk through each step, from planning and gathering data requirements, prototyping our data models with SQL, implementing logic behind several of the data models, productionizing the jobs with Spark, and publishing the data insights to the merchant.

The Tax Insights Data Architecture

Above is a high-level view of the data architecture for Tax Insights. The core application databases are extracted and loaded into our data warehouse, which sits in Google Cloud Storage (GCS). These tables are processed with Spark jobs to create a series of data models. The final model that contains end-user data is loaded into Google Cloud Bigtable, where it is queried by the core application and served to the frontend for the end user to see. Additionally, the data in GCS can also be consumed by our analytics platform, inform other business processes or features, and make its way to the backend and frontend of the core application.

Become a Tax Expert

Before we started writing code, we first needed to understand the data and requirements. Unfortunately, I don’t have an accounting degree. So, for this project, we had to consult our internal tax experts to get up to speed on the U.S. tax code. But it didn’t stop there. we needed to work with our other stakeholders in product, engineering, UX, and marketing, in order to answer questions like:

How will this product benefit our merchants?
How do merchants stay on top of tax compliant and enter that info into our platform?
What page(s) in Shopify admin do merchants go to do this, and how often do they actually do this?
Why aren’t some merchants tax compliant today?
How are the database tables in the backend populated when merchants are collecting taxes, calculating taxes, and fulfilling orders?
What types of orders count towards tax liability, and what types do not?
How large is our addressable market?
How will we plan to announce this new feature and drive merchants through our adoption funnel?
What metrics do we need to track in order to measure success, and what kind of reporting do we need to create?

There are countless questions to ask and to answer when initially scoping a new product, and most of these will apply outside of taxes. Once we had a grasp on our goals and became familiar with the data, the real fun began!

Demonstrate Those SQL Skills

Now that we knew our aim, we planned out our next steps. When you’re dealing with multiple complex data models, it’s useful to write down every data model that you think you’ll need, and a timeline for each. It’s likely that you’ll want to break out your work into intermediate datasets, as it may be necessary to use them individually. This could be useful for analytics and also will be easier to debug. And when the time comes for implementation, you will likely write clear, more modular code that is easier to test and understand, rather than having one giant mega-job.

Before implementing your data models in PySpark (or your framework of choice), if you know that you will be querying your datasets from your enterprise data warehouse, then it may be beneficial to first prototype them in SQL.

For this project, since there were so many upstream data sources that already existed in Google Cloud Storage (GCS), it was much easier to query the tables directly with Google BigQuery, rather than to find the raw file parquet file locations and create a bunch of Spark dataframes in a Python notebook.

Handling Complex Tax Laws and Dates

I mentioned early on that U.S. tax laws are complex and vary state-to-state. This extends to just beyond simple tax rates per state. What does that mean exactly?

There are varying economic nexus thresholds, such as any combination of some dollar amount and/or number of orders. In one state you might need to have had greater than $250,000 sales in the evaluation period, excluding digital orders and taxable marketplace orders (on Facebook Marketplace, as an example). But in another state, the threshold is broken if you have done greater than or equal to $100,000 in sales AND greater than or equal to 200 sales in the time period. And there are many different combinations of these. But what is most notable is what constitutes the time period. There are multiple different economic nexus evaluation periods across the U.S. states:

Previous calendar year
Previous or current calendar year (that is, there are two separate periods to consider)
12-month rolling window, up until the last complete month
Four preceding complete tax quarters, as is the case in New York

So, at any point in time, in any given U.S. state, if we want to evaluate a shop’s tax liability, we need to ask, what is the timeframe(s) to aggregate sales in?

And what happens if tax laws change? The time period, the sales threshold, or another rule could change in any given state as eCommerce laws are always evolving.

In order to handle this dynamic situation, we first created a look-up table with all of the rules, with `valid_from` and `valid_to` dates in order to handle changing laws. Then, we used this table to create a new dimensional model that spits out the following data, at any given date for any given U.S. state (as well as DC and Puerto Rico):

What is the start and end date of the time period(s) to evaluate sales in?
What is the sales threshold?
What is the orders threshold?
Are the thresholds inclusive or exclusive?
What types of orders are included and excluded?

This model made it much easier to aggregate each merchant’s state-level sales against this rule set, automatically handles dynamically changing data, and allows us to continuously calculate economic nexus for each merchant—even when the New Year comes around.

Economic Nexus Model

Another model that we created aggregated total net, taxable sales for each merchant, in each state, in each month. And by joining that model with the model described in the previous section, we have our last dataset: an actual evaluation of each merchant’s state-level tax liabilities over time.

This model creates a row at the merchant-state-month grain. In other words, for every specific merchant, in every state, for every month:

Over what time period(s) should sales and orders be evaluated?
Which types of orders and sales should be evaluated?
What are the sales and orders thresholds, if applicable?
How much in sales has the merchant done in said time period(s)?
How many orders has the merchant had in said time period(s)?
Does that state have an economic nexus law?
Did the merchant have a tax liability?
When did the merchant first reach that tax liability?
Is the merchant already compliant in that state, meaning they are collecting taxes and don’t need to be notified?
How up-to-date is this data?

And since this model gives us the economic nexus data over time, we simply need to take the latest snapshot and send that to our merchants to keep them informed of their sales tax obligations. Additionally, this model is the one that will help us answer our business questions, track product performance, and ultimately tell us if we are meeting our goals and benefitting our merchants.

Spark Up Those Engines

By this point in the process, we opened some PRs on our SQL prototypes and received the okay from our peers. Now for the easy part—creating the data models. At Shopify, we have an internal data pipeline platform built on top of PySpark called Starscream. This platform runs tens of thousands of jobs and writes hundreds of terabytes of data per day.

We also have a newer framework called Seamster that leverages dbt (data build tool), which means we can skip the translation to PySpark and almost directly use the SQL prototypes, greatly speeding up shipping time. (Read more in this blog post about building a production-grade workflow with SQL modeling!)

Translating SQL into PySpark is pretty straightforward, so I’ll skip those details. Starscream is nice because it has abstracted a lot of the nitty-gritty but commonly-used Spark code into reusable classes to handle everything from simple transformations to complex skew joins, while also allowing the flexibility to tune your resource management.

So, one-by-one, PR after PR and unit test after unit test, we shipped these data models, loaded them to our data warehouse, queried them in our favorite business intelligence tool, and saw the fruits of our labor.

Give the Engineers Some Data

Finally, now that we have all our data models shipped and loaded to our data lake, the last step was to serve the final insights to our merchants. How did we do this?

Every time the job that calculates each merchant’s historical state-level tax liability data runs, it also kicks off another job, which takes the most up-to-date data and immediately loads it into a key-value store using Google Cloud’s Bigtable. We defined a schema whose key is a merchant’s unique identifier, and whose value is the state-level tax liability insights. The backend of Shopify’s core application is then able to quickly fetch this data and serve it to the frontend for our merchant to see. Beautiful!

With our merchants being able to see and act upon their U.S. tax liability information directly from their store’s tax settings page, and with the data also stored in our data warehouse, we are able to then measure the positive impact we have on our merchants’ tax burden, learn additional insights, and move on to making other areas of sales tax as delightful as possible.

Three Lessons Learned

Creating a user-facing product is exciting and rewarding, but it wasn’t without its challenges. Here are some learnings that we took away:

1. Always Include Some Buffer in Your Timelines

At more than one point, we were behind schedule for handing off the final dataset to our backend engineers, which resulted in some stressful crunchtime. Data modeling is not always easy. And even if you are used to creating datasets for internal use only, when there is a customer-facing aspect, the complexity grows significantly. Extra time will often need to be spent ensuring that:

Nuances of the data are understood

All edge cases are caught and handled
The data quality is correct and reliable
The job(s) is optimized and will scale as your data grows

Be sure to plan for extra time in order to still maintain expectations and meet deadlines.

2. Consult the Experts As Much as Possible

The U.S. tax code is incredibly nuanced, as I’m sure you’ve gathered by now. For someone who is not a tax expert, understanding the data was challenging and we found many edge cases. Luckily, Shopify has incredibly smart individuals who do understand the intricacies of the U.S. tax system, and so we were able to consult them (very often) to prevent our team’s wheels from spinning.

3. Plan For the Future

This post discusses a U.S.-based feature on our platform. But what about other regions and countries with different tax laws? Fortunately, we considered this next step before we built the product, and so we were able to generalize much of our code and infrastructure in order to easily extend the functionality beyond the United States. This paid off, because afterwards, we launched the same feature for our merchants who are required to collect Canadian GST. And that time around, everything went much more smoothly.

Siraj is on the Shopify Tax team where he enjoys using data science to solve the complex tax needs of our merchants. Prior to Shopify, he brought together his skills in software engineering and machine learning in the finance industry. He lives in the Washington, D.C. area with his wife and two cats.

Are you passionate about solving data problems and eager to learn more about Shopify? Check out openings on our careers page.