By Jodi Sloan and Selina Li
Over 2 million users visit Shopify’s Help Center every month to find help solving a problem. They come to the Help Center with different motives: learn how to set up their store, find tips on how to market, or get advice on troubleshooting a technical issue. Our search product helps users narrow down what they’re looking for by surfacing what’s most relevant for them. Algorithms empower search products to surface the most suitable results, but how do you know if they’re succeeding at this mission?
Below, we’ll share the three-step framework we built for evaluating new search algorithms. From collecting data using Kafka and annotation, to conducting offline and online A/B tests, we’ll share how we measure the effectiveness of a search algorithm.
Search is a difficult product to build. When you input a search query, the search product sifts through its entire collection of documents to find suitable matches. This is no easy feat as a search product’s collection and matching result set might be extensive. For example, within the Shopify Help Center’s search product lives thousands of articles, and a search for “shipping” could return hundreds.
We use an algorithm we call Vanilla Pagerank to power our search. It boosts results by the total number of views an article has across all search queries. The problem is that it also surfaces non-relevant results. For example, if you conducted a search for “delete discounts” the algorithm may prefer results with the keywords “delete” or “discounts”, but not necessarily results on “deleting discounts”.
We’re always trying to improve our users’s experience by making our search algorithms smarter. That’s why we built a new algorithm, Query-specific Pagerank, which aims to boost results with high click frequencies (a popularity metric) from historic searches containing the search term. It basically boosts the most frequently clicked-on results from similar searches.
The challenge is any change to the search algorithms might improve the results for some queries, but worsen others. So to have confidence in our new algorithm, we use data to evaluate its impact and performance against the existing algorithm. We implemented a simple three-step framework for evaluating experimental search algorithms.
1. Collect Data
Before we evaluate our search algorithms, we need a “ground truth” dataset telling us which articles are relevant for various intents. For Shopify’s Help Center search, we use two sources: Kafka events and annotated datasets.
Events-based Data with Kafka
Users interact with search products by entering queries, clicking on results, or leaving feedback. By using a messaging system like Kafka, we collect all live user interactions in schematized streams and model these events in an ETL pipeline. This culminates in a search fact table that aggregates facts about a search session based on behaviours we’re interested in analyzing.
With data generated from Kafka events, we can continuously evaluate our online metrics and see near real-time change. This helps us to:
- Monitor product changes that may be having an adverse impact on the user experience.
- Assign users to A/B tests in real time. We can set some users’s experiences to be powered by search algorithm A and others by algorithm B.
- Ingest interactions in a streaming pipeline for real-time feedback to the search product.
Annotation is a powerful tool for generating labeled datasets. To ensure high-quality datasets within the Shopify Help Center, we leverage the expertise of the Shopify Support team to annotate the relevance of our search results to the input query. Manual annotation can provide high-quality and trustworthy labels, but can be an expensive and slow process, and may not be feasible for all search problems. Alternatively, click models can build judgments using data from user interactions. However, for the Shopify Help Center, human judgments are preferred since we value the expert ratings of our experienced Support team.
The process for annotation is simple:
- A query is paired with a document or a result set that might represent a relative match.
- An annotator judges the relevance of the document to the question and assigns the pair a rating.
- The labels are combined with the inputs to produce a labelled dataset.
There are numerous ways we annotate search results:
- Binary ratings: Given a user’s query and a document, an annotator can answer the question Is this document relevant to the query? by providing a binary answer (1 = yes, 0 = no).
- Scale ratings: Given a user’s query and a document, a rater can answer the question How relevant is this document to the query? on a scale (1 to 5, where 5 is the most relevant). This provides interval data that you can turn into categories, where a rating of 4 or 5 represents a hit and anything lower represents a miss.
- Ranked lists: Given a user’s query and a set of documents, a rater can answer the question Rank the documents from most relevant to least relevant to the query. This provides ranked data, where each query-document pair has an associated rank.
The design of your annotation options depends on the evaluation measures you want to employ. We used Scale Ratings with a worded list (bad, ok, good, and great) that provides explicit descriptions on each label to simplify human judgment. These labels are then quantified and used for calculating performance scores. For example, when conducting a search for “shipping”, our Query-specific Pagerank algorithm may return documents that are more relevant to the query where the majority of them are labelled “great” or “good”.
One thing to keep in mind with annotation is dataset staleness. Our document set in the Shopify Help Center changes rapidly, so datasets can become stale over time. If your search product is similar, we recommend re-running annotation projects on a regular basis or augmenting existing datasets that are used for search algorithm training with unseen data points.
2. Evaluate Offline Metrics
After collecting our data, we wanted to know whether our new search algorithm, Query-specific Pagerank, had a positive impact on our ranking effectiveness without impacting our users. We did this by running an offline evaluation that uses relevance ratings from curated datasets to judge the effectiveness of search results before launching potentially-risky changes to production. By using offline metrics, we run thousands of historical search queries against our algorithm variants and use the labels to evaluate statistical measures that tell us how well our algorithm performs. This enables us to iterate quickly since we simulate thousands of searches per minute.
There are a number of measures we can use, but we’ve found Mean Average Precision and Normalized Discounted Cumulative Gain to be the most useful and approachable for our stakeholders.
Mean Average Precision
Mean Average Precision (MAP) measures the relevance of each item in the results list to the user’s query with a specific cutoff N. As queries can return thousands of results, and only a few users will read all of them, only top N returned results need to be examined. The top N number is usually chosen arbitrarily or based on the number of paginated results. Precision@N is the percentage of relevant items among the first N recommendations. MAP is calculated by averaging the AP scores for each query in our dataset. The result is a measure that penalizes returning irrelevant documents before relevant ones. Here is an example of how MAP is calculated:
Normalized Discounted Cumulative Gain
To compute MAP, you need to classify if a document is a good recommendation by determining a cutoff score. For example, document and search query pairs that have ok, good, and great labels (that is scores greater than 0) will be categorized as relevant. But the differences in the relevancy of ok and great pairs will be neglected.
Discounted Cumulative Gain (DCG) addresses this drawback by maintaining the non-binary score while adding a logarithmic reduction factor—the denominator of the DCG function to penalize the recommended items with lower positions in the list. DCG is a measure of ranking quality that takes into account the position of documents in the results set.
One issue with DCG is that the length of search results differ depending on the query provided. This is problematic because the more results a query set has, the better the DCG score, even if the ranking doesn’t make sense. Normalized Discounted Cumulative Gain Scores (NDCG) solves this problem. NDCG is calculated by dividing DCG by the maximum possible DCG score—the score calculated from the sorted search results.
Comparing the numeric values of offline metrics is great when the differences are significant. The higher value, the more successful the algorithm. However, this only tells us the ending of the story. When the results aren’t significantly different, the insights we get from the metrics comparison are limited. Therefore, understanding how we got to the ending is also important for future model improvements and developments. You gather these insights by breaking down the composition of queries to look at:
- Frequency: How often do our algorithms return worse results than annotation data?
- Velocity: How far off are our rankings in different algorithms?
- Commonalities: Understanding the queries that consistently have positive impacts on the algorithm performance, and finding the commonality among these queries, can help you determine the limitations on an algorithm.
Offline Metrics in Action
We conducted a deep dive analysis on evaluating offline metrics using MAP and NDCG to assess the success of our new Query-specific Pagerank algorithm. We found that our new algorithm returned higher scored documents more frequently and had slightly better scores in both offline metrics.
3. Evaluate Online Metrics
Next, we wanted to see how our users interact with our algorithms by looking at online metrics. Online metrics use search logs to determine how real search events perform. They’re based on understanding users’ behaviour with the search product and are commonly used to evaluate A/B tests.
The metrics you choose to evaluate the success of your algorithms depends on the goals of your product. Since the Shopify Help Center aims to provide relevant information to users looking for assistance with our platform, metrics that determine success include:
- How often users interact with search results
- How far they had to scroll to find the best result for their needs
- Whether they had to contact our Support team to solve their problem.
When running an A/B test, we need to define these measures and determine how we expect the new algorithm to move the needle. Below are some common online metrics, as well as product success metrics, that we used to evaluate how search events performed:
- Click-through rate (CTR): The portion of users who clicked on a search result when surfaced. Relevant results should be clicked, so we want a high CTR.
- Average rank: The average rank of clicked results. Since we aim for the most relevant results to be surfaced first, and clicked, we want to have a low average rank.
- Abandonment:When a user has intent to find help but they didn’t interact with search results, didn’t contact us, and wasn’t seen again. We always expect some level of abandonment due to bots and spam messages (yes, search products get spam too!), so we want this to be moderately low.
- Deflection: Our search is a success when users don’t have to contact support to find what they’re looking for. In other words, the user was deflected from contacting us. We want this metric to be high, but in reality we want people to contact us when it’s what is best for their situation, so deflection is a bit of a nuanced metric.
We use the Kafka data collected in our first step to calculate these metrics and see how successful our search product is over time, across user segments, or between different search topics. For example, we study CTR and deflection rates between users in different languages. We also use A/B tests to assign users to different versions of our product to see if our new version significantly outperforms the old.
A/B testing search is similar to other A/B tests you may be familiar with. When a user visits the Help Center, they are assigned to one of our experiment groups, which determines which algorithm their subsequent searches will be powered by. Over many visits, we can evaluate our metrics to determine which algorithm outperforms the other (for example, whether algorithm A has a significantly higher click-through rate with search results than algorithm B).
Online Metrics in Action
We conducted an online A/B test to see how our new Query-specific Pagerank algorithm measured against our existing algorithm, with half of our search users assigned to group A (powered by Vanilla Pagerank) and half assigned to group B (powered by Query-specific Pagerank). Our results showed that users in group B were:
- Less likely to click past the first page of results
- Less likely to conduct a follow-up search
- More likely to click results
- More likely to click the first result shown
- Had a lower average rank of clicked results
Essentially, group B users were able to find helpful articles with less effort when compared to group A users.
After using our evaluation framework to measure our new algorithm against our existing algorithm, it was clear that our new algorithm outperformed the former in a way that was meaningful for our product. Our metrics showed our experiment was a success, and we were able to replace our Vanilla Pagerank algorithm with our new and improved Query-specific Pagerank algorithm.
If you’re using this framework to evaluate your search algorithms, it’s important to note that even a failed experiment can help you learn and identify new opportunities. Did your offline or online testing show a decrease in performance? Is a certain subset of queries driving the performance down? Are some users better served by the changes than other segments? However your analysis points, don’t be afraid to dive deeper to understand what’s happening. You’ll want to document your findings for future iterations.
Key Takeaways for Evaluating a Search Algorithm
Algorithms are the secret sauce of any search product. The efficacy of a search algorithm has a huge impact on a users’ experience, so having the right process to evaluate a search algorithm’s performance ensures you’re making confident decisions with your users in mind.
To quickly summarize, the most important takeaways are:
- A high-quality and reliable labelled dataset is key for a successful, unbiased evaluation of a search algorithm.
- Online metrics provide valuable insights on user behaviours, in addition to algorithm evaluation, even if they’re resource-intensive and risky to implement.
- Offline metrics are helpful for iterating on new algorithms quickly and mitigating the risks of launching new algorithms into production.
Jodi Sloan is a Senior Engineer on the Discovery Experiences team. Over the past 2 years, she has used her passion for data science and engineering to craft performant, intelligent search experiences across Shopify. If you’d like to connect with Jodi, reach out on LinkedIn.
Selina Li is a Data Scientist on the Messaging team. She is currently working to build intelligence in conversation and improve merchant experiences in chat. Previously, she was with the Self Help team where she contributed to deliver better search experiences for users in Shopify Help Center and Concierge. If you would like to connect with Selina, reach out on Linkedin.
If you’re a data scientist or engineer who’s passionate about building great search products, we’re hiring! Reach out to us or apply on our careers page.