Using Propensity Score Matching to Uncover Shopify Capital’s Effect on Business Growth

By Breno Freitas and Nevena Francetic

Five years ago, we introduced Shopify Capital, our data-powered product that enables merchants to access funding from right within the Shopify platform. We built it using a version of a recurrent neural network (RNN)—analyzing more than 70 million data points across the Shopify platform to understand trends in merchants’ growth potential and offer cash advances that make sense for their businesses. To date, we’ve provided our merchants with over $2.7 billion in funding.

But how much of an impact was Shopify Capital having, really? Our executives wanted to know—and as a Data team, we were invested in this question too. We were interested in validating our hypothesis that our product was having a measurable, positive impact on our merchants.

We’ve already delved into the impact of the program in another blog post, Digging Through the Data: Shopify Capital's Effect on Business Growth, but today, we want to share how we got our results. In this post, we’re going behind the scenes to show you how we investigated whether Shopify Capital does what we intended it to do: help our merchants grow.

The Research Question

What’s the impact on future cumulative gross merchandise value (for example, sales) of a shop after they take Shopify Capital for the first time?

To test whether Shopify merchants who accepted Capital were more successful than those who didn’t, we needed to compare their results against an alternative future (the counterfactual) in which merchants who desired Capital didn’t receive it. In other words, an A/B test.

Unfortunately, in order to conduct a proper A/B test, we would need to randomly and automatically reject half of the merchants who expressed interest in Capital for some period of time in order to collect data for proper analysis. While this makes for good data collection, it would be a terrible experience for our users and undermine our mission to help merchants grow, which we were unwilling to do.

With Shopify Capital only being active in the US in 2019, an alternative solution would be to use Canadian merchants who didn’t yet have access to Shopify Capital (Capital launched in Canada and the UK in Spring 2020) as our “alternate reality.” We needed to seek out Canadian shops who would have used Shopify Capital if given the opportunity, but weren’t able to because it wasn’t yet available in their market.

We can do this comparison through a method called “propensity score matching” (PSM).

Matchmaker, Matchmaker, Make Me a Match

In the 1980s, researchers Rosenbaum and Rubin proposed PSM as a method to reduce bias in the estimation of treatment effects with observational data sets. This is a method that has become increasingly popular in medical trials and in social studies, particularly in cases where it isn’t possible to complete a proper random trial. A propensity score is defined as the likelihood of a unit being assigned to the treatment group. In this case: What are the chances of a merchant accepting Shopify Capital if it were offered to them?

It works like this: After propensity scores are estimated, the participants are matched with similar counterparts on the other set, as depicted below.

Depiction of matching performed on two sets of samples based on their propensity scores.
Depiction of matching performed on two sets of samples based on their propensity scores.

We’re looking for a score of similarity for taking treatment and only analyzing samples in both sets that are close enough (get a match) and respecting any other constraints imposed by the selected matching methodology. This means we could even be dropping samples from treatment when matching, if the scores fall outside of the parameters we’ve set.

Once matched, we’ll be able to determine the difference in gross merchandise value (GMV), that is, sales, between the control and treatment groups in the six months after they take Shopify Capital for the first time.

Digging into the Data Sets

As previously discussed, in order to do the matching, we needed two sets of participants in the experiment, the treatment group, and the control group. We decided to set our experiment for a six-month period, starting in January 2019 to remove any confounding effect of COVID-19.

We segment our two groups as follows:

  • Treatment Group: American shops that were first-time Capital adopters in January 2019, on the platform for at least three months prior (to ensure they were established on the platform), and still Shopify customers in April 2020.
  • Control Group: Canadian shops that had been a customer for at least three months prior to January 2019 and pre-qualified for Capital in Canada when we launched it in April 2020.
  • Ideally, we would have recreated underwriting criteria from January 2019 to see which Canadian shops would have pre-qualified for Capital at that time. To proxy for this, we looked at shops that remained stable until at least April 2020 in the US and Canada, and then went backwards to analyze their 2019 data.

    Key assumptions:

  • Shops in Canada didn’t take an offer for the sole reason that Capital didn’t exist in Canada at that time.
  • Shops in the US and Canada have equal access to external financing sources we can’t control (for example, small business loans)
  • The environments that Canadian and US merchants operate in are more or less the same
  • Matchmaking Methodology

    We began our matching process with approximately 8,000 control shops and about 600 treated shops. At the end of the day, our goal was to make the distributions of the propensity scores for each group of shops match as closely as possible.

    Foundational Setup

    For the next stage in our matching, we set up some features, using characteristics from within the Shopify platform to describe a shop. The literature says there’s no right or wrong way to pick characteristics—just use your discernment to choose whichever ones make the most sense for your business problem.

    We opted to use merchants’ (which we’ll refer to as shops) sales and performance in Shopify. While we have to keep the exact characteristics a secret for privacy reasons, we can say that some of the characteristics we used are the same ones the model would use to generate a Shopify Capital offer.

    At this stage, we also logarithmically transformed many of the covariates. We did this because of the wild extremes we can get in terms of variance on some of the features we were using. Transforming them to logarithmic space shrinks the variances and thus makes the linear regressions behave better (for example, to shrink large disparities in revenue). This helps minimize skew.

    It’s a Match!

    There are many ways we could match the participants on both sets—the choice of algorithm depends on the research objectives, desired analysis, and cost considerations. For the purpose of this study, we chose a caliper matching algorithm.

    A caliper matching algorithm is basically a nearest neighbors (NN) greedy matching algorithm where, starting from the largest score, the algorithm tries to find the closest match on the other set. It differs from a regular NN greedy algorithm as it only allows for matches within a certain threshold. The caliper defines the maximum distance the algorithm is allowed to have between matches—this is key because if the caliper is infinite, you’ll always find a neighbor, but that neighbor might be pretty far away. This means not all shops will necessarily find matches, but the matches we end up with will be fairly close. We followed Austin’s recommendation to choose our caliper width.

    After computing the caliper and running the greedy NN matching algorithm, we found a match to all but one US first-time Capital adopter within Canadian counterparts.

    Matching Quality

    Before jumping to evaluate the impact of Capital, we need to determine the quality of our matching. We used the following three techniques to assess balance:

    1. Standardized mean differences: This methodology compares the averages of the distributions for the covariates for the two groups. When close to zero, it indicates good balance. Several recommended thresholds have been published in the literature with many authors recommending 0.1. We can visualize this using a “love plot,” like so:

      Love plot comparing feature absolute standardized differences before and after matching.
      Love plot comparing feature absolute standardized differences before and after matching.
    2. Visual Diagnostics: Visual diagnostics such as empirical cumulative distribution plots (eCDF), quantile-quantile plots, and kernel density plots can be used to see exactly how the covariate distributions differ from each other (that is, where in the distribution are the greatest imbalances). We plot their distributions to check visually how they look pre and post matching. Ideally, the distributions are superimposed on one another after matching.

      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups
      Propensity score plots before matching - less overlapping before matching indicating less matches were found between groups.
      Propensity score plots after matching - Increased overlapping indicating good matches between groups
      Propensity score plots after matching - Increased overlapping indicating good matches between groups.
    3. Variance Ratios: The variance ratio is the ratio of the variance of a covariate in one group to that in the other. Variance ratios close to 1 indicate good balance because they imply the variances of the samples are similar, whereas numbers close to 2 are sometimes considered extreme. Only one of our covariates was hitting the 0.1 threshold in the standardized mean differences method. Visual comparison (see above) showed great improvement and good alignment in covariate distributions for the matched sets. And all of our variance ratios were below 1.3.

    The checks presented cover most of the steps presented in the literature in regards to making sure the matching is okay to be used in further analysis. While we could go further and keep tweaking covariates and testing different methods until a perfect matching is achieved, that would risk introducing bias and wouldn’t guarantee the assumptions would be any stronger. So, we decided to proceed with assessing the treatment effect. 

    How We Evaluated Impact

    At this point, the shops were matched, we had the counterfactual and treatment group, and we knew the matching was balanced. We’d come to the real question: Is Shopify Capital impacting their sales? What’s the difference in GMV between shops who did and didn’t receive Shopify Capital? 

    In order to assess the effect of the treatment, we set up a simple binary regression: y’ = β₀ + β₁ * T.

    Where T is a binary indicator of whether or not the data point is for a US or Canadian shop, β₀ is the intercept for the regression and β₁ is the coefficient that will show how being on treatment on average influences our target. Target, 𝑦', is a logarithm of the cumulative six-month GMV, from February to July 2019,  plus one (that is, log1p transform of six-month sales).

    Using this methodology, we found that US merchants on average had a 36% higher geometric average of cumulative six-month GMV after taking Capital for the first time than their peers in Canada.

    How Confident Are We in Our Estimated Treatment Effect? 

    In order to make sure we were confident in the treatment effect we calculated, we ran several robustness checks. We won’t get into the details, but we used the margins package, simulated an A/A test to validate our point estimate, and followed Greifer’s proposed method for bootstrapping.

    Cumulative geometric average of sales between groups before and after taking their first round of Capital
    Cumulative geometric average of sales between groups before and after taking their first round of Capital.

    Our results show that the 95% confidence interval for the average increase in the target, after taking Capital for the first time, is between 13% and 65%. The most important takeaway is that the lower bound is positive—so we can say with high confidence that Shopify Capital has a positive effect on merchants’ sales.

    Final Thoughts

    With high statistical significance, backed by robustness checks, we concluded that the average difference in the geometric mean of GMV in the following six months after adopting Shopify Capital for the first time is +36%, bounded by +13% and +65%. We can now say with confidence that Shopify Capital does indeed help our merchants—and not only that, but it validates the work we’re doing as a data team. Through this study, we were able to prove that one of our first machine learning products has a significant real-world impact, making funding more accessible and helping merchants grow their businesses. We look forward to continuing to create innovative solutions that help our merchants achieve their goals.

    Breno Freitas is a Staff Data Scientist working on Shopify Capital Data and a machine learning researcher at Federal University of Sao Carlos, Brazil. Breno has worked with Shopify Capital for over four years and currently leads a squad within the team. Currently based in Ottawa, Canada, Breno enjoys kayaking and working on DIY projects in his spare time.

    Nevena Francetic is a Senior Data Science Manager for Money at Shopify. She’s leading teams that use data to power and transform financial products. She lives in Ottawa, Ontario and in her spare time she spoils her little nephews. To connect, reach her on LinkedIn.

    Are you passionate about data discovery and eager to learn more, we’re always hiring! Reach out to us or apply on our careers page.