By Ali Wytsma and C. Carquex

Over the last decade, machine learning underwent a broad democratization. Countless tutorials, books, lectures, and blog articles have been published related to the topic. While the technical aspects of how to build and optimize models are well documented, very few resources are available on how developing machine learning models fits within a business context. When is it a good idea to use machine learning? How to get started? How to update a model over time without breaking the product?

Below, we’ll share five steps and supporting tips on approaching machine learning from a business perspective. We’ve used these steps and tips at Shopify to help build and scale our suite of machine learning products. They may look simple, but when used together they give a straight-forward workflow to help you productionize models that actually drive impact.

A flow diagram representing the five steps for building machine learning models for business as discussed in the article. — Guide for building machine learning models

1. Ask Yourself If It’s the Right Time for Machine Learning?

Before starting the development of any machine learning model, the first question to ask is: should I invest resources in a machine learning model at this time? It’s tempting to spend lots of time on a flashy machine learning algorithm. This is especially true if the model is intended to power a product that is supposed to be “smart”. Below are two simple questions to assess whether it’s the right time to develop a machine learning model:

a. Will This Model Be Powering a Brand New Product?

Launching a new product requires a tremendous amount of effort, often with limited resources. Shipping a first version, understanding product fit, figuring out user engagement, and collecting feedback are critical activities to be performed. Choosing to delay machine learning in these early stages allows resources to be freed up and focused instead on getting the product off the ground.

Plans for setting up the data flywheel and how machine learning can improve the product down the line are recommended. Data is what makes or breaks any machine learning model, and having a solid strategy for data collection will serve the team and product for years to come. We recommend exploring what will be beneficial down the line so that the right foundations are put in place from the beginning, but holding off on using machine learning until a later stage.

To the contrary, if the product is already launched and proven to solve the user’s pain points, developing a machine learning algorithm might improve and extend it.

b. How Are Non-machine Learning Methods Performing?

Before jumping ahead with developing a machine learning model, we recommend trying to solve the problem with a simple heuristic method. The performance of those methods is often surprising. A benefit to starting with this class of solution is that they’re typically easier and faster to implement, and provide a good baseline to measure against if you decide to build a more complex solution later on. They also allow the practitioner to get familiar with the data and develop a deeper understanding of the problem they are trying to solve.

In 90 percent of cases, you can create a baseline using heuristics. Here are some of our favorite for various types of business problems:

Forecasting	For forecasting with time series data, moving averages are often robust and efficient
Predicting Churn	Using a behavioural cohort analysis to determine user dropoff points are hard to beat
Scoring	For scoring business entities (for example leads and customers), a composite index based on two or three weighted proxy metrics is easy to explain and fast to spin up
Recommendation Engines	Recommending content that’s popular across the platform with some randomness to increase exploration and content diversity is a good place to start
Search	Stemming and keyword matching gives a solid heuristic

2. Keep It Simple

When developing a first model, the excitement of seeking the best possible solution often leads to adding unnecessary complexity early on: engineering extra features or choosing the latest popular model architecture can certainly provide an edge. However, they also increase the time to build, the overall complexity of the system, as well as the time it takes for a new team member to onboard, understand, and maintain the solution.

On the other hand, simple models enable the team to rapidly build out the entire pipeline and de-risk any surprise that could appear there. They’re the quickest path to getting the system working end-to-end.

At least for the first iteration of the model, we recommend being mindful of these costs by starting with the simplest approach possible. Complexity can always be added later on if necessary. Below are a few tips that help cut down complexity:

Start With Simple Models

Simple models contribute to iteration speed and allow for better understanding of the model. When possible, start with robust, interpretable models that train quickly (shallow decision tree, linear or logistic regression are three good initial choices). These models are especially valuable for getting buy-in from stakeholders and non-technical partners because they’re easy to explain. If this model is adequate, great! Otherwise, you can move to something more complex later on. For instance, when training a model for scoring leads for our Sales Representatives, we noticed that the performance of a random forest model and a more complex ensemble model were on par. We ended up keeping the first one since it was robust, fast to train, and simple to explain.

Start With a Basic Set of Features

A basic set of features allows you to get up and running fast. You can defer most feature engineering work until it’s needed. Having a reduced feature space also means that computational tasks run faster with a quicker iteration speed. Domain experts often provide valuable suggestions for where to start. For example at Shopify, when building a system to predict the industry of a given shop, we noticed that the weight of the products sold was correlated with the industry. Indeed, furniture stores tend to sell heavier products (mattresses and couches) than apparel stores (shirts and dresses). Starting with these basic features that we knew were correlated allowed us to get an initial read of performance without going deep into building a feature set.

Leverage Off-the-shelf Solutions

For some tasks (in particular tasks related to images, video, audio, or text), it’s essential to use deep learning to get good results. In this case, pre-trained, off the shelf models help build a powerful solution quickly and easily. For instance, for text processing, a pre-trained word embedding model that feeds into a logistic regression classifier might be sufficient for an initial release. Fine tuning the embedding to the target corpus comes in a subsequent iteration, if there’s a need for it.

3. Measure Before Optimizing

A common pitfall we’ve encountered is starting to optimize machine learning models too early. While it’s true that thousands of parameters and hyper-parameters have to be tuned (with respect to the model architecture, the choice of a class of objective functions, the input features, etc), jumping too fast to that stage is counterproductive. Answering the two questions below before diving in helps make sure your system is set up for success.

a. How is the Incremental Impact of the Model Going to Be Measured?

Benchmarks are critical to the development of machine learning models. They allow for the comparison of performance. There are two steps to creating a benchmark, and the second one is often forgotten.

Select a Performance Metric

The metric should align with the primary objectives of the business. One of the best ways to do so is by building an understanding of what the value means. For instance, what does an accuracy of 98 percent mean in the business context? In the case of a fraud detection system, accuracy would be a poor metric choice, and 98 percent would indicate poor performance as instances of fraud are typically rare. In another situation, 98 percent accuracy could mean great performance on a reasonable metric.

For comparison purposes, a baseline value for the performance metric can be provided by an initial non-machine learning method, as discussed in the Ask Yourself If It’s the Right Time for Machine Learning? section.

Tie the Performance Metric Back to the Impact on the Business

Design a strategy to measure the impact of a performance improvement on the business. For instance, if the metric chosen in step one is accuracy, the strategy chosen in step two should allow the quantification of how each percentage point increment impacts the user of the product. Is an increase from 0.8 to 0.85 a game changer in the industry or barely noticeable to the user? Are those 0.05 extra points worth the potential added time and complexity? Understanding this tradeoff is key to deciding how to optimize the model and drives decisions such as continuing or stopping to invest time and resources in a given model.

b. Can You Explain the Tradeoffs That the Model Is Making?

When a model appears to perform well, it’s easy to celebrate too soon and become comfortable with the idea that machine learning is an opaque box with a magical performance. Based on experience, in about 95 percent of cases the magical performance is actually the symptom of an issue in the system. A poor choice of performance metric, a data leakage, or an uncaught balancing issue are just a few examples of what could be going wrong.

Being able to understand the tradeoffs behind the performance of the model will allow you to catch any issues early, and avoid wasting time and compute cycles on optimizing a faulty system. One way to do this is by investigating the output of the model, and not just its performance metrics:

Classification System	In a classification system, what does the confusion matrix look like? Does the balancing of classes make sense?
Regression Model	When fitting a regression model, what does the distribution of residuals look like? Is there any apparent bias?
Scoring System	For a scoring system, what does the distribution of scores look like? Are they all grouped toward one end of the scale?

Example

Order Dataset
Prediction Accuracy 98%

Actual

		Order is fraudulent	Order is not fraudulent
Prediction	Order is fraudulent	0	0
	Order is not fraudulent	20	1,000

Example of a model output with an accuracy of 98%. While 98% may look like a win, there are 2 issues at play:

The model is consistently predicting “Order isn’t fraudulent”.
Accuracy isn’t the appropriate metric to measure the performance of the model.

Optimizing the model in this state doesn’t make sense, the metric needs to be fixed first.

Optimizing the various parameters becomes simpler once the performance metric is set and tied to a business impact: the optimization stops when it doesn’t drive any incremental business impact. Similarly, by being able to explain the tradeoffs behind a model, errors that are otherwise masked by an apparent great performance are likely to get caught early.

4. Have a Plan to Iterate Over Models

Machine learning models evolve over time. They can be retrained at a set frequency. Their architecture can be updated to increase their prediction power or features can be added and removed as the business evolves. When updating a machine learning model, the roll out of the new model is usually a critical part. We must understand our performance relative to our baseline, and there should be no regression in performance. Here are a few tips that have helped us do this effectively:

Set Up the Pipeline Infrastructure to Compare Models

Models are built and rolled out iteratively. We recommend investing in building a pipeline to train and experimentally evaluate two or more versions of the model concurrently. Depending on the situation, there are Depending on the situation, there are several ways that new models are evaluated. Two great methods are:

If it’s possible to run an experiment without surfacing the output in production (for example, for a classifier where you have access to the labels), having a staging flow is sufficient. For instance, we did this in the case of the shop industry classifier, mentioned in the Keep It Simple section. A major update to the model ran in a staging flow for a few weeks before we felt confident enough to promote it to production. When possible, running an offline experiment is preferable because if there are performance degradations, they won’t impact users.
An online A/B test works well in most cases. By exposing a random group of users to our new version of the model, we get a clear view of it’s impact relative to our baseline. As an example, for a recommendation system where our key metric is user engagement, we assess how engaged the users exposed to our new model version are compared to users seeing the baseline recommendations to know if there’s a significant improvement.

Make Sure Comparisons Are Fair

Will the changes affect how the metrics are reported? As an example, in a classification problem, if the class balance is different between the set the model variant is being evaluated on and production, the comparison may not be fair. Similarly, if we’re changing the dataset being used, we may not be able to use the same population for evaluating our production model as our variant model. If there is bias, we try to change how the evaluations are conducted to remove it. In some cases, it may be necessary to adjust or reweight metrics to make the comparison fair.

Consider Possible Variance in Performance Metrics

One run of the variant model may not be enough to understand it’s impact. Model performance can vary due to many factors like random parameter initializations or how data is split between training and testing. Verify its performance over time, between runs and based on minor differences in hyperparameters. If the performance is inconsistent, this could be a sign of bigger issues (we’ll discuss those in the next section!). Also, verify whether performance is consistent across key segments in the population. If that’s a concern, it may be worth reweighting the metric to prioritize key segments.

Does the Comparison Framework Introduce Bias?

It’s important to be aware of the risks of overfitting when optimizing, and to account for this when developing a comparison strategy. For example, using a fixed test data set can cause you to optimize your model to those specific examples. Incorporating practices like cross validation, changing the test data set, using a holdout, regularization, and running multiple tests whenever random initializations are involved into your comparison strategy helps to mitigate these problems.

5. Consider the Stability of the Model Over Time

One aspect that’s rarely discussed is the stability of prediction as a model evolves over time. Say the model is retrained every quarter, and the performance is steadily increasing. If our metric is good, this means that performance is improving overall. However, individual subjects may have their predictions changed, even as the performance increases overall. That may cause a subset of users to have a negative experience with the product, without the team anticipating it.

As an example, consider a case where a model is used to decide whether a user is eligible for funding, and that eligibility is exposed to the user. If the user sees their status fluctuate, that could create frustration and destroy trust in the product. In this case, we may prefer stability over marginal performance improvements. We may even choose to incorporate stability into our model performance metric.

Two graphs side by side representing model Q1 on the left and model Q2 on the right. The graphs highlight the difference between accuracy and how overfitting can change that. — Example of the decision boundary of a model, at two different points in time. The symbols represent the actual data points and the class they belong to (red division sign or blue multiplication sign). The shaded areas represent the class predicted by the model. Overall the accuracy increased, but two samples out of the eight switched to a different class. It illustrates the case where the eligibility status of a user fluctuates over time.

Being aware of this effect and measuring it is the first line of defense. The causes vary depending on the context. This issue can be tied to a form of overfitting, though not always. Here’s our checklist to help prevent this:

Understand the costs of changing your model. Consider the tradeoff between the improved performance versus the impact of changed predictions, and the work that needs to be done to manage that. Avoid major changes in the model, unless the performance improvements justify the costs.
Prefer shallow models to deep models. For instance in a classification problem, a change in the training dataset is more likely to make a deep model update its decision boundary in local spots than a shallow model. Use deep models only when the performance gains are justified.
Calibrate the output of the model. Especially for classification and ranking systems. Calibration highlights changes in distribution and reduces them.
Check for objective function condition and regularization. A poorly conditioned model has a decision boundary that changes wildly even if the training conditions change only slightly.

The Five Factors That Can Make or Break Your Machine Learning Project

To recap, there are a lot of factors to consider when building products and tools that leverage machine learning in a business setting. Carefully considering these factors can make or break the success of your machine learning projects. To summarize, always remember to:

Ask yourself if it’s the right time for machine learning? When releasing a new product, it’s best to start with a baseline solution and improve it down the line with machine learning.
Keep it simple! Simple models and feature sets are typically faster to iterate on and easier to understand, both of which are crucial for the first version of a machine learning product.
Measure before optimizing. Make sure that you understand the ins and outs of your performance metric and how it impacts the business objectives. Have a good understanding of the tradeoffs your model is making.
Have a plan to iterate over models. Expect to iteratively make improvements to the model, and make a plan for how to make good comparisons between new model versions and the existing one.
Consider the stability of the model over time. Understand the impact stability has on your users, and take that into consideration as you iterate on your solution.

Ali Wytsma is a data scientist leading Shopify's Workflows Data team. She loves using data in creative ways to help make Shopify's admin as powerful and easy to use as possible for entrepreneurs. She lives in Kitchener, Ontario, and spends her time outside of work playing with her dog Kiwi and volunteering with programs to teach kids about technology.

Carquex is a Senior Data Scientist for Shopify’s Global Revenue Data team. Check out his last blog on 4 Tips for Shipping Data Products Fast.

We hope this guide helps you in building robust machine learning models for whatever business needs you have! If you’re interested in building impactful data products at Shopify, check out our careers page.

5 Steps for Building Machine Learning Models for Business