In this blog post, I’ll be introducing how Shopify is improving CRuby’s performance in Ruby 3.2 by optimizing the memory layout in the garbage collector through the Variable Width Allocation project.
Table of Contents
Ruby’s Memory Structure and Limitations
Ruby is a garbage collected language. It automatically allocates memory when objects are created and releases the memory when it detects that an object is no longer used. This means that the garbage collector is involved in the whole lifecycle of objects, so optimizing the memory layout of the garbage collector has the potential to improve the overall performance of the Ruby interpreter.
If you’re interested in how the garbage collector works for Ruby 3.0 (which is still largely applicable today), check out my earlier article.
Object Allocation
All Ruby objects (with the exception of immediates like nil
, true
, false
, small integers, and floats) live in 40 byte slots called RVALUE
. This means that Ruby’s garbage collector only allocates memory of a single size. This is done for simplicity, as the garbage collector only needs to manage chunks of memory of a single size. However, the disadvantages are clear: different types of objects have vastly different memory characteristics. For example, arrays are dynamic in size and could change in size during runtime while classes are static in size. This means that any data that can’t fit in the RVALUE
has to be allocated externally, such as through the system using malloc
.
These slots are managed in Ruby heap pages, with each page being 16 KB in size. This results in each page having 409 slots. Ruby uses heap pages as a unit to manage metadata, such as a counter to keep track of how many live objects are on that page. It also improves allocation performance since allocating memory from the system comes with a performance cost. We want to minimize the number of times we allocate memory from the system by allocating large chunks at a time.
Limitations
Although managing only a single sized slot in the Ruby garbage collector simplifies the implementation and avoids issues such as memory fragmentation, there are several limitations it introduces. I’ll introduce the limitations here and later discuss how we’re solving each of them through our project, Variable Width Allocation.
Poor Data Locality
We often think of our systems as just having one level of memory, the main RAM. However, there are actually multiple tiers of memory inside of your CPU, called caches. Each level of cache is larger in capacity than the last, but also with higher latency and lower bandwidth. For example, on my computer, there’s only 384 KB of level 1 cache compared to 32 MB of level 3 cache, but the level 1 cache is much faster (1500 GB/s versus 500 GB/s) and of lower latency (0.9 ns versus 10 ns). The main RAM is much slower (50 GB/s) and has much higher latency (80 ns) compared to the caches.
To put these numbers in a human scale, if it were to take us one second to read data from the level 1 cache, reading the same data from RAM would make the CPU wait for about a minute and a half. So it’s clear that we want to use data in the caches rather than fetching from memory to improve performance.
Whenever a piece of data is fetched from the main memory, it fetches it in units of cache lines from the main RAM and stores it in the caches. The size of the cache line varies by platform, it’s 64 bytes on x86 and 128 bytes on Apple silicon. When the cache is full, it will evict the least recently used entries to make space for new entries.
However, a 40 byte RVALUE
doesn’t take advantage of the size cache line meaning that some of the fetched memory is unused. Additionally, since the RVALUE
doesn’t provide enough space, objects may allocate memory in other locations which increases the amount of space used in the caches.
Performance Overhead of malloc
Allocating memory from the system using malloc
isn’t free, it comes with a performance overhead. By minimizing the number of malloc
calls, we can decrease this overhead. Reducing the number of malloc
calls can improve boot time and speed up object creation.
Inefficient Use of Memory
Since different data types in Ruby have vastly different memory characteristics, forcing all objects in a 40 byte slot means that some objects won’t be able to take advantage of all the space while other objects that require more space need to store pointers to other regions of memory. Storing pointers requires additional space and some implementations of malloc
requires additional memory to store metadata about the allocated memory, both increasing the amount of memory used.
Limitations to Extensibility
Other languages like Java have support for multiple types of garbage collectors. This allows researchers to experiment and develop new garbage collection techniques. Additionally, this allows users to change garbage collectors depending on their use case (for example, lower memory usage, faster boot time, or faster performance). In Ruby, since all objects are of the same size and most of the data for objects are externally allocated through malloc
, it’s difficult for other garbage collector implementations to perform any optimizations.
In the Enabling Cutting-Edge Research section, you’ll see an example of how this work enables academic researchers to perform cutting-edge garbage collector research on Ruby.
Variable Width Allocation
Variable Width Allocation is a feature that was authored with Matt Valentine-House, Aaron Patterson and myself. The goal of Variable Width Allocation is to introduce APIs to allocate dynamic sized objects and add support for variable sized slots inside the Ruby garbage collector. Having variable size slots allows types with different memory characteristics to allocate a size most optimal for it.
Size Pools
With Variable Width Allocation, we introduce a new concept into the garbage collector called “size pools.” Each size pool has slots of a particular size. By keeping fixed sized slots in the size pool, we continue to avoid issues with memory fragmentation and keep allocations fast.
We currently have five size pools, with the slot sizes being powers of two multiples of RVALUE
size, meaning they’re 40 bytes, 80 bytes, 160 bytes, 320 bytes, and 640 bytes. We chose powers of two multiples since this ensures that we have 75 percent utilization on average. It’s more performant as it avoids frequent resizing when an object expands or shrinks in size. We’ve measured this hypothesis to be true by performing memory dumps of Ruby processes and determined that around 75 percent of the memory in the slots was utilized. We chose to have multiples of RVALUE
size so that we can fit a single RVALUE
without any wastage in memory for types that aren’t yet on Variable Width Allocation.
Another solution we considered was to create a special size pool that’s of RVALUE
size that would allow us to make the other size pools have slot sizes in powers of two (32 bytes, 64 bytes, 128 bytes, etc.). However, we determined that the small efficiency gains weren’t worth the extra complexity of having a special size pool.
Heap Page Size Increase
As part of this work, we increased heap page size from 16 KB to 64 KB. We did this because for heap pages with larger slot sizes (especially with the 640 byte slot), there weren’t many slots per page (around 25). Since there weren’t many slots per page, more pages needed to be allocated, costing performance from the frequent page allocation. Additionally, since each heap page has a header used to store metadata, the increase in allocated metadata increased memory overhead.
However, to increase page sizes, we had to decouple page size from how the garbage collector worked because otherwise changing the page size would affect performance. Increasing page sizes to 64 KB also improves performance because it decreases the number of page allocations. It also has another advantage: on systems with 64 KB system memory pages (for example, PowerPC), it allows the Ruby page size to be equal to the system page size. By having Ruby page sizes that are equal to the system page size, additional features (for example, compaction) can be used on those systems.
Resizing and Compaction
Up until now, we’ve only discussed allocation of objects through Variable Width Allocation. However, we know that objects can change in size during runtime. Objects could be resized upwards to the point where it doesn’t fit in the slot or could be resized downwards to the point where a significant amount of memory is wasted. So how do we deal with these two cases? For both, we currently rely on the compaction feature of the garbage collector to bring us to the most optimal state.
Compaction is a phase during garbage collection that moves objects close together to reduce fragmentation. We modified compaction to move objects to the most optimal size pool. For objects that need to be increased in size, we first fall back to allocating through the system using malloc
until compaction brings us to the most optimal state by placing the contents adjacent to the object again, regaining data locality.
Types
There are currently four types covered by Variable Width Allocation: classes, arrays, strings, and objects. I’ll talk about how these are implemented.
Classes
Classes were the first type we introduced to Variable Width Allocation. We chose classes first because they’re of a fixed size, so we know the exact size at allocation time, and so we could implement Variable Width Allocation without needing to implement any logic for resizing.
For classes, we now store the rb_classext_struct
adjacent to the class object. The rb_classext_struct
stores data for the class, including instance variables, the superclass, and constants. Before Variable Width Allocation, since the rb_classext_struct
was too big to fit in the RVALUE
, it was separately allocated using malloc
when a class is created. Since it’s now allocated along with the class object, we no longer need to allocate any memory from the system when a class is allocated.
Strings
Strings were the second type we introduced to Variable Width Allocation. We chose strings because they’re dynamic in size and a common type in Ruby programs. Strings already supported storing the contents adjacent to the object, referred to as “embedded.” However, since RVALUE
was previously limited to 40 bytes in size, only strings up to 23 bytes in size could be embedded. After Variable Width Allocation, strings up to 615 bytes can now be embedded. Strings that are too large will still have their contents stored in memory allocated via the system using malloc
. We saw significant speedups in microbenchmarks (up to a 35 percent speedup).
Arrays
We then implemented Variable Width Allocation for arrays. Just like strings, there were already embedded arrays. However, they were limited to just three elements or fewer. After Variable Width Allocation, arrays up to 78 elements in length can be embedded. Arrays that are too large will still have their contents stored in memory allocated via the system using malloc
.
Objects
The last type on Variable Width Allocation is objects. Just to avoid confusion, even though everything in Ruby are objects, we’re talking specifically about objects created from classes defined in your code (in other words, not one of the core types in Ruby such as arrays, classes, strings, hashes, and integers).
Objects store a list of their instance variables, so we chose to implement that on Variable Width Allocation. Just like arrays, there were already embedded objects, but they were limited to just three instance variables or fewer. After Variable Width Allocation, objects up to 78 instance variables can be embedded. Objects that have more than 78 instance variables will still have their contents stored in memory allocated via the system using malloc
. We saw significant speedups in microbenchmarks when reading and writing to instance variables (around a 20 percent speedup).
Enabling Cutting-Edge Research on Ruby
The Variable Width Allocation project enables academic researchers to experiment with new memory layouts and garbage collection techniques by implementing the Memory Management Toolkit (MMTk) project inside of Ruby. MMTk is a cutting-edge garbage collector research project from Professor Steve Blackburn at the Australian National University. It’s exciting because MMTk provides a standard set of memory management APIs that will allow us to try different garbage collection algorithms without modifying the Ruby source code. Different garbage collectors have different characteristics and will allow users to switch depending on their use case. For example, the mark-and-sweep garbage collector (which is the one currently in CRuby) optimizes for lower memory usage while a semispace copying garbage collector optimizes for better performance (at the cost of using twice as much memory).
This research is partially sponsored by Shopify and you can learn more about the various Ruby research projects Shopify is sponsoring from our post, Shopify Invests in Research for Ruby at Scale.
Benchmark Results
We can see performance improvements in almost all benchmarks, with significant improvements in workloads that are heavy on reading and writing to data structures, such as parsing YAML files (psych-load) or generating PDF files (hexapdf). In railsbench, we can see a respectable 2.1 percent increase in performance.
You can find the benchmarking setup and the raw data in the Appendix.
Statistics Collection
Variable Width Allocation introduces a new API to collect statistics from the garbage collector. GC.stat_heap
is a new method that returns statistics about the size pools. This method is useful for determining the memory characteristics of your Ruby program or debugging whether a particular size pool is causing performance issues. Since this method returns internal information about the garbage collector, the structure of the data returned is subject to change. Refer to the method documentation for more information.
The Escape Hatch
If you run into performance or compatibility issues (particularly with native gems), you can disable Variable Width Allocation by passing CPPFLAGS='-DUSE_RVARGC=0'
to the configure step when building Ruby. Please also let us know of this issue by filing a bug report to bugs.ruby-lang.org. Note that this escape hatch will only be available on Ruby 3.2 and support will be dropped in the future.
Appendix
Benchmark Setup
Benchmarking was done on a Ubuntu 22.04.1 machine with an AMD Ryzen 5 3600X and 32 GB of RAM. Ruby was benchmarked on a development version of 3.2.0 (commit 80e56d1).
All benchmarks were run using yjit-bench
(with YJIT disabled to test interpreter performance) on commit 0cf6940.
Raw Benchmark Results
All benchmark results are in milliseconds for a single iteration, so lower is better.
VWA enabled |
VWA disabled |
Speedup |
|
activerecord (ms) |
121 |
124 |
1.025x |
hexapdf (ms) |
2206 |
2322 |
1.053x |
liquid-render (ms) |
142 |
145 |
1.021x |
mail (ms) |
121 |
119 |
0.983x |
psych-load (ms) |
1791 |
1967 |
1.098x |
railsbench (ms) |
1782 |
1820 |
1.021x |
ruby-lsp (ms) |
60 |
64 |
1.067x |
Peter is a Ruby core committer and Senior Developer at Shopify. He’s currently working on improving the performance of Ruby by optimizing the memory layout through the Variable Width Allocation project. He’s the author of ruby_memcheck, a gem used to find memory leaks in native gems. It’s found memory leaks in popular gems such as Nokogiri, protobuf, gRPC, and liquid-c.
We all get shit done, ship fast, and learn. We operate on low process and high trust, and trade on impact. You have to care deeply about what you’re doing, and commit to continuously developing your craft, to keep pace here. If you’re seeking hypergrowth, can solve complex problems, and can thrive on change (and a bit of chaos), you’ve found the right place. Visit our Engineering careers page to find your role.