We recently released ShopifyQL Notebooks, a powerful self-serve tool which enables merchants to explore how their business metrics are changing, understand the reasons for the change, and share their findings as rich data stories. ShopifyQL Notebooks is powered by its very own query language ShopifyQL, and by data models built specifically for commerce. In addition to merchants, Shopify partners can use ShopifyQL and the commerce data models to build apps that address niche merchant needs.
In building commerce data models our main focus was allowing business-savvy users to query data in an intuitive way without any SQL expertise. To make this vision a reality, our commerce data models encapsulate key metrics for a merchant’s business across different business processes. For example, a natural business question to pose is “What are the sales and view-to-purchase conversion metrics for product X last month?”. Traditionally you would need a data analyst to answer this question, since it required querying data assets owned by the sales and marketing departments and joining the two.
Building commerce data models with the intention of simplifying querying was challenging. We started with a comprehensive set of business questions a merchant may want answered. Next, we determined the columns our data model should contain to answer these questions. Finally, we had to determine proper granularity for our data model—in other words, what does each row in our data model represent? A related challenge we faced while building our commerce data model was reconciling the difference in granularity of different datasets, for instance, between the sales and marketing data mentioned above. On top of all this, we had to ensure our data model returned results quickly even for very large merchants.
Determining What Properties We Wanted in the Data Model
We want the data models we give to merchants to be purpose-driven, explorable, and extensible. Let’s take a look at each of these in more detail.
Purpose-driven data models enable users to have a clear plan of action. In our case, we want our users to answer commerce-related questions across different domains such as products, orders, and customers. We want to be opinionated about the data we include in the model with every column having a clear goal in mind for the user.
Explorable data models are built with the right granularity, useful dimensions, and easy-to-aggregate metrics. Finding the right granularity is all about striking the right balance. For instance, do we need each row of the data model to be on the second time grain? Are users going to need data to make decisions on a second-by-second basis? In our case, the answer is probably not. However, creating our data model using a monthly time grain may be not specific enough for our users. In commerce, we can be opinionated on the dimensions that users need to make good decisions. For example, we provided merchants with dimensions like “sales channel” and “product title” so they can drill down on things that matter most to them.
Providing users with fully additive metrics across all dimensions can make exploring datasets easier. This means users don’t have to worry about aggregate functions conforming to all available dimensions. In order to understand the benefits of full additive metrics, let’s understand the two other types of metrics: semi-additive and non-additive metrics. Let’s take a look at the fake data below:
Event At |
Product Name |
Customers who placed an order |
Gross Sales |
Discount % |
2022-01-02 |
Best T-shirt |
2 |
$60 |
5% |
2022-01-02 |
Cool Jeans |
4 |
$200 |
10% |
2022-01-03 |
Nice ties |
5 |
$1345 |
20% |
2022-01-03 |
Best T-shirt |
3 |
$90 |
0% |
There are five unique customers in this dataset. Each customer placed orders for different products. For instance, customer 1 placed an order for a “Best T-shirt” and “Cool Jeans”. The “customers who place an order” is a semi-additive metric because if we tried to sum the column to calculate daily total customers, we would be double counting customers. However, “Gross sales” is fully additive because it can be summed across both the “Event At” dimension and the “Product Name” dimension. The “Discount %” metric is non-additive and can never be summed by any of the dimensions. If you did, you would end up with a value much greater than 100% which wouldn’t make sense.
We want our users to feel comfortable building on top of these models for their own purposes. Suppose we present our user with a sales data model. Perhaps the user is interested in creating a weekly customer retention rate model to track customer satisfaction. We should keep these types of abstractions in mind when building data models. Similarly, we should allow users to create their own metrics that may be different from the ones we offer in the model. For instance, suppose we want to provide our users with a net sales metric using the data below:
Gross Sales |
Returns |
Shipping |
Discounts |
Net Sales |
$500.00 |
$30.00 |
$10.00 |
$50.00 |
$420.00 |
In our dataset, we calculate Net Sales as:
Net Sales = Gross sales - Returns - Discounts
But suppose our users want to include shipping in their net sales calculation. They should be able to do so and aggregate by the same dimensions as the “Net Sales” metric.
Some examples of commerce-focused data models are the products, orders, and benchmarks datasets. We built these data models keeping in mind that our users have varying degrees of technical knowledge, so we wanted to make sure they are easy to understand.
The Data Modeling Process
In the previous section, we outlined what properties we wanted in a data model. In this section, we'll tell you how to achieve this. You can approach data modelling in many different ways. This is the approach that worked for us.
1. Start With a List of Business Questions
The first step is to understand what metrics and dimensions are most important to you or your users (for us that’s our merchants). This empowers users to tell great data stories and encourages them to outline clear plans for action. In our case, this means that our merchants create marketing campaigns for their best converting products.
We approached this problem by having several conversations with Product Managers, UX Researchers, and any other domain experts we could get our hands on. You want these questions to lead to potential actions that you or your users can take. Here are a couple questions we came up with for our merchants:
-
What is the conversion rate of my top selling products?
Even a small improvement in the conversion rate of top selling products can have a large impact on revenue. -
Which of my products are not selling well?
List the products that were viewed frequently, but are not top selling.
The goal is to write down as many business questions as possible and then break them down to create a list of data columns to include into the data model.
2. Create Mock Data
Creating mock data helps confirm that our data model is at the correct grain and contains the correct columns. There are a couple ways to do this, but I usually just open up Google Sheets and create my data manually. I like to create enough mock data that I have a few unique values for each of my columns. After that, you can easily load it into the data platform of your choice such as BigQuery or Trino. For the products model, our data looked like this:
Now that the data is queryable, you can start writing queries. For each business question, we should have a corresponding query that can answer the question. In ShopifyQL Notebooks, we refer to these as “templates”. Templates let you know how you can take advantage of the data that is available.
Here’s an example of a query and its results (using mock data) for the business question “What are my top converting products?”:
Since ShopifyQL has its own syntax and aggregate functions, writing queries on mock data ensures the language can answer the business questions given the right data.
3. Find the Data
Now, you can make a list of all the tables that you need to build your data model. If the data doesn’t exist, put together a plan for creating it. In the figure below, you can see a basic representation of all the upstream data we required for creating the products model.
4. Assess Data Quality and Consistency
Next we assess data quality and consistency. In order to avoid incorrect data, take some time to analyze the distributions of different columns in your data model. Make note of outliers, missing data, and invalid data.
Check out our post on conducting exploratory data analysis for more great ideas on how you can assess data quality.
Data consistency considerations arise when data appears in many places across organizations. For example, at Shopify we surface information about products in many parts of the Shopify admin. The Shopify admin is where merchants are directed to once they log into Shopify. The Shopify admin helps merchants set up their stores, configure settings, and manage their business. If a merchant sees a value for total views for a given product in one place of the Shopify admin, it should match other places. If it doesn’t match, we need to ensure we communicate why the user is seeing a discrepancy.
5. Assess Model Freshness
The data should be fresh and performant for all queries to your data model. Depending on your upstream datasets, freshness can be a real problem and degrade the value proposition of your data model for users. For instance, suppose you have an upstream data model that is computationally intensive and has a long build time, it will hold back the freshness of your model. Your model’s freshness is only as good as your least fresh upstream model.
We retrieved the SLOs of each of our input datasets and performed a max of the min calculation to determine the overall freshness of our products model. Speak to stakeholders and make sure your model freshness is acceptable and does not degrade your user experience.
6. Assess Model Performance
Model performance can be broken down into two categories: build time and query performance. Model build time refers to how long it takes the data pipeline to stitch together input datasets into the final model. There are many different ways to optimize your build time.
One way we improved the products model build time is by changing the data pipeline from a full-drop data model into an incremental one. A full-drop data model is when the full data model completely replaces the existing one with a new data model. An incremental data model involves only loading the difference between the existing data in the model and the new data that is available.
Incremental models can vastly improve your data model build time and in our case we saw a huge improvement. However, incremental models can be harder to maintain than full-drop ones, but in our case the benefits far outweigh the cost. The main benefits in our use case had to do with performance. Incremental models process less data at a time leading to faster model build times than full-drop data models. Incremental models can be more challenging because new data can be different from existing data. This new incoming data can include additional columns which would require a full historical rebuild of the data model. Lastly, incremental models usually require additional dependencies in your data pipeline resulting in more time spent monitoring by your team.
Query performance involves how fast your query engine (for us that’s ShopifyQL) completes queries written against your data model. High cardinality columns can be difficult to GROUP BY and thus require roll-ups. Depending on the grain of our dataset, you may need to scan over many rows to generate your desired result. The larger the time period selected, the more rows have to be scanned. This can result in slow queries and an undesirable user experience. However, once the data is a bit old, it's unlikely to change (for example, we're unlikely to see corrections in the sales we recorded a month ago). And we know what grain of data we want to fetch. So for instance, we know that we want to fetch daily sales by each product title, and the finest grain we need is daily. So we can “roll” all the rows for a single day per product “up” into a single row, and store that in a separate table. Looking through this table is less costly and faster than looking at the table with the raw data since there are fewer rows to scan. This process can greatly improve query times and lead to a much better user experience.
The Products Model in Action
In this section, we are going to show you how Shopify’s products data model can be queried using ShopifyQL notebooks. The products data model is built to inform merchants about how their products are selling and converting.
Let’s take a look at how our products are converting.
I can now answer questions like “what are my most popular products by traffic?”.
I can also write queries to identify which products I need to improve, such as products abandoned at checkout.
With the products model, users can understand how customers are engaging with your product pages and progressing through checkout. Understanding the customer journey and identifying bottlenecks in converting visitors into paying customers is essential to improving revenue.
In Review
So we finished our walkthrough. Let’s review the main takeaways:
- Data models should be purpose-driven, explorable, and extensible.
- Start with a list of business questions. Speak to stakeholders and domain experts to better understand the power your data model is meant to unlock.
- Create mock data and query it. To ensure ShopifyQL can answer your business questions, we need to write real queries against fake data.
- Discover available data. Identify the upstream data models that you require to build your data model.
- Assess data model quality, freshness, and performance. These are essential in providing a delightful user experience.
Cody Mazza-Anthony is a Senior Data Scientist on the Insights team. Cody enjoys building intelligent systems to enable merchants to grow their business. If you’d like to connect with Cody, reach out on LinkedIn.
Visit our Data Science and Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together—a future that is digital by design.