StatsD at Shopify

Here at Shopify, we like data. One of the many tools in our data toolbox is StatsD. We've been using StatsD in production at Shopify for many months now, consistently sending multiple events to our StatsD instance on every request.

What is StatsD good for?

In my experience, there are two things that StatsD really excels at. First, getting a high level overview of some custom piece of data. We use NewRelic to tell us about the performance of our apps. NewRelic provides a great overview of our performance as a whole, even down to which of our controller actions are slowest, and though it has an API for custom instrumentation I've never used it. For custom metrics we're using StatsD.

We use lots of memcached, and one metric we track with StatsD is cache hits vs. cache misses on our frontend. On every request that hits a cacheable action we send an event to StatsD to record a hit or miss. 

Caching Baseline (Green: cache hits, Blue: cache misses)

 

Note: The graphs in this article were generated by Graphite, the real-time graphing system that StatsD runs on top of.

As an example of how this is useful, we recently added some data to a cache key that wasn't properly converted to a string, so that piece of the key was appearing to be unique far more often than it was. The net result was more cache misses than usual. Looking at our NewRelic data we could see that performance was affected, but it was difficult to see exactly where. The response time from our memcached servers was still good, the response time from the app was still good, but our number of cache misses had doubled, our number of cache hits had halved, and overall user-facing performance was down.

A problem


 

It wasn't until we looked at our StatsD graphs that we fully understood the problem. Looking at our caching trends over time we could clearly see that on a specific date something was introduced that was affecting caching negatively. With a specific date we were able to track down the git commit and fix the issue. Keeping an eye on our StatsD graphs we immediately saw the behaviour return to the normal trend.

Return to Baseline

 

The second thing that StatsD excels at is proving assumptions. When we're writing code we're constantly making assumptions. Assumptions about how our web app may be used, assumptions about how often an interaction will be performed, assumptions about how fast a particular operation may be, assumptions about how successful a particular operation may be. Using StatsD it becomes trivial to get real data about this stuff.

For instance, we push a lot of products to Google Product Search on behalf of our customers. There was a point where I was seeing an abnormally high number of failures returned from Google when we were posting these products via their API. My first assumption was that something was wrong at the protocol level and most of our API requests were failing. I could have done some digging around in the database to get an idea of how many failures we were getting, cross referenced with how many products we were trying to publish and how frequently, etc. But using our StatsD client (see below) I was able add a simple success/failure metric to give me a high level overview of the issue. Looking at the graph from StatsD I could see that my assumption was wrong, so I was able to eliminate that line of thinking.

statsd-instrument

We were excited about StatsD as soon as we read Etsy's announcement. We wrote our own client and began using it immediately. Today we're releasing that client. It's been in use in production since then and has been stalwartly collecting data for us. On an average request we're sending ~5 events to StatsD and we don't see a performance hit. We're actually using StatsD to record the raw number of requests we handle over time.

statsd-instrument provides some basic helpers for sending data to StatsD, but we don't typically use those directly. We definitely didn't want to litter our application with instrumentation details so we wrote metaprogramming methods that allow us to inject that instrumentation where it's needed. Using those methods we have managed to keep all of our instrumentation contained to one file in our config/initializers folder. Check out the README for the full API or pull down the statsd-instrument rubygem to use it.

A sample of our instrumentation shows how to use the library and the metaprogramming methods:

# Liquid
Liquid::Template.extend StatsD::Instrument
Liquid::Template.statsd_measure :parse, 'Liquid.Template.parse'
Liquid::Template.statsd_measure :render, 'Liquid.Template.render'

# Google Base
GoogleBase.extend StatsD::Instrument
GoogleBase.statsd_count_success :update_products!, 'GoogleBase.update_products'

# Webhooks
WebhookJob.extend StatsD::Instrument
WebhookJob.statsd_count_success :perform, 'Webhook.perform'

That being said, there are a few places where we do make use of the helpers directly (sans metaprogramming), still within the confines of our instrumentation initializer:

ShopAreaController.after_filter do
  StatsD.increment 'Storefront.requests', 1, 0.1

  return unless request.env['cacheable.cache']

  if request.env['cacheable.miss']
    StatsD.increment 'Storefront.cache.miss'
  elsif request.env['cacheable.store'] == 'client'
    StatsD.increment 'Storefront.cache.hit_client'
  elsif request.env['cacheable.store'] == 'server'
    StatsD.increment 'Storefront.cache.hit_server'
  end
end

Today we're recording metrics on everything from the time it takes to parse and render Liquid templates, how often our Webhooks are succeeding, performance of our search server, average response times from the many payment gateways we support, success/failure of user logins, and more.

As I mentioned, we have many tools in our data toolbox, and StatsD is a low-friction way to easily collect and inspect metrics. Check out statsd-instrument on github.