I was recently profiling a production Shopify application server using perf
and noticed a fair amount of time being spent in a particular function, st_lookup
, which is used by Ruby’s MRI implementation for hash table lookups:
Hash tables are used all over MRI, and not just for the Hash
object; global variables, instance variables, classes, and the garbage collector all use MRI’s internal hash table implementation, st_table
. Unfortunately, what this profile did not show were the callers of st_lookup
. Is this some application code that has gone wild? Is this an inefficiency in the VM?
perf
is a great sampling profiler for Linux — it’s low overhead and can safely be used in production. However, up until a few years ago, in order to use the call-graph feature of perf
you had to recompile an application with -fno-omit-frame-pointer
to get usable stack traces. This gives the compiler one less register to work with, but I believe the trade-off is worth it in most cases. As of Linux 3.7, perf
now supports recording call graphs even when code is compiled without -fno-omit-frame-pointer
. This works by recording a snapshot of stack memory on each sample. When analyzing the profile later with perf report
, the stack data from each sample is combined with DWARF debugging information in order to build a call graph. This increases the amount of data included in the profile, but is a reasonable compromise when compared with having to recompile everything with -fno-omit-frame-pointer
.
Now when running perf and collecting call graphs, I was able to see the callers of st_lookup
:
From this profile, we can see that a large percentage of time is being spent in rb_method_entry_get_without_cache
. At first, I suspected this was being caused by global method cache invalidation or clearing. After using SystemTap on the method__cache__clear
probe and seeing a relatively low count, it was clear that this was not the case. At this point, I started digging into the MRI source, trying to understand what exactly happens when a method is called and how method caching actually works. Looking through the MRI source shows that effectively the only place that rb_method_entry_get_without_cache
is called is via rb_method_entry
:
The idea here is that the global method cache will be checked first (via the GLOBAL_METHOD_CACHE
macro), and if the method is found, the method entry will be returned from the cache. If the method is not in the cache, a more expensive method lookup needs to be performed. This involves walking the class hierarchy to find the requested method, which can cause multiple invocations of st_lookup
to look up the method from a class method table.
The GLOBAL_METHOD_CACHE
macro performs a basic hash lookup on the provided key to locate the entry in the global method cache. Looking at this code, I was surprised to see that the global method cache only had room for 2048 entries. Shopify is a large Rails application with millions of lines of Ruby (when you include gems), and defines hundreds of thousands of methods. A 2048-entry global method cache is nowhere near adequate for an application of our size.
Analyzing method cache hit rates
One of the first things I wanted to do was determine the existing method cache hit-to-miss ratio. One way of doing this would be to add tracing code to our MRI to log this data. This would require us to deploy a custom Ruby build to our app servers, which is not an ideal solution. There is a better way: ftrace
user probes.
ftrace
is a lightweight tracing framework that is built into the Linux kernel. One of the best features of ftrace
is the ability to create dynamic user-mode event probes (uprobes). By adding a probe, ftrace
will fire an event whenever a specified function is called in a process. Better yet, you can even fire an event on the execution of a specific instruction. This was all I needed to collect method cache hit-to-miss numbers.
To configure a uprobe, you need to specify the path to the binary and the address of the instruction you want to trace. How do you figure out the address of the instruction? Using objdump
:
The first field objdump
prints is the address of the function. Using this, I can now create a probe:
Now when I run Ruby, I’ll get an event every time rb_method_entry
is called:
What this tells me is that running Ruby with a simple “hello world” caused rb_method_entry
to be called 4441 times.
This is really cool! But what I originally wanted to figure out was the global method cache hit to miss ratio. By disassembling rb_method_entry
, we can figure out what code paths are executed in both hit and miss cases, which we can use to set the appropriate probes:
If you look at the C implementation of the function, you can see how it maps to the generated assembly. The relevant bits here are:
-
0x00000000001aa16a
is the address that jumps torb_method_entry_get_without_cache
. This is the path we take on a cache miss. -
0x00000000001aa194
is the address of the return instruction that we’ll execute on a cache hit. Because cache misses perform an unconditional jump torb_method_entry_get_without_cache
, we can infer that this instruction will only ever be executed on a hit.
With this information, we can set up our two probes, one for global method cache misses and one for hits:
And here are the results running the simple “hello world” script:
This is what we wanted to see! So for a simple hello world script, there were 727 method cache misses and 3714 hits, giving us a hit ratio of about 84%. Now that we have a method of calculating the hit rate, let’s see what it looks like in production!
Rather than have to execute these commands every time I wanted to take a sample, I wrote a small shell script to do it for me:
The script takes a binary of the ruby
you want to profile and the number of seconds to collect data. After the number of seconds has elapsed, it will display the number of cache hits and misses, along with the hit rate.
So, how does this look in production? With the default method cache size of 2048 entries, we get about a 90% hit rate. This is pretty good, but according to the results I saw in perf
, it could definitely be better.
Introducing RUBY_GLOBAL_METHOD_CACHE_SIZE
The global method cache size is configured as a compile-time #define
, GLOBAL_METHOD_CACHE_SIZE
. I thought it would be useful to be able to configure the method cache size at runtime, like you can do with garbage collector parameters (RUBY_GC_*
environment variables). To that end, I added a new environment variable, RUBY_GLOBAL_METHOD_CACHE_SIZE
, that can be used to configure the global method cache size at process start time. This code has been committed upstream and is available in Ruby 2.2.
Now that I had a way to dynamically configure the method cache size, I could run some tests to see how Shopify performed using various cache sizes. I ran the trace_rb_method_cache.sh
script on one of our production servers for 60 seconds with a few different cache sizes and collected the results:
Cache Size | Hits | Misses | Hit Rate |
2K (Default) | 10,227,750 | 1,114,933 | 90.17% |
16K | 11,697,582 | 516,446 | 95.77% |
32K | 13,596,388 | 425,919 | 96.96% |
64K | 14,263,747 | 324,277 | 97.78% |
128K | 13,590,568 | 289,948 | 97.91% |
256K | 11,532,080 | 275,162 | 97.67% |
512K | 12,403,522 | 256,329 | 97.98% |
From these numbers, once we get around 64K it looks like the hit rates begin to level off.
Now, if we do a perf
before and after tuning, we can see some respectable results:
2K method cache
128K method cache
This change gives us a cycle savings of about 3%. Not bad for changing one configuration value!
Summary
Using the system-level profiling tools, perf
and
, I was able to find a performance issue that would have never been visible with Ruby-level tooling. These tools are available on any modern Linux distribution, and I encourage you to experiment with them on your own application to see what kind of benefits can be had!ftrace