Finding Relationships Between Ruby’s Top 100 Packages and Their Dependencies

by Kevin Lin
Development

Oct 19, 2022
9 minute read

In June of this year, RubyGems, the main repository for Ruby packages (gems), announced that multi-factor authentication (MFA) was going to be gradually rolled out to users. This means that users eventually will need to login with a one-time password from their authenticator device, which will drastically reduce account takeovers.

The team I'm interning on, the Ruby Dependency Security team at Shopify, played a big part in rolling out MFA to RubyGems users. The team’s mission is to increase the security of the Ruby software supply chain, so increasing MFA usage is something we wanted to help implement.

A large Ruby with stick arms and leg pats a little Ruby with stick arms and legs — Illustration by Kevin Lin

One interesting decision that the RubyGems team faced is determining who was included in the first milestone. The team wanted to include at least the top 100 RubyGems packages, but also wanted to prevent packages (and people) from falling out of this cohort in the future.

To meet those criteria, the team set a threshold of 180 million downloads for the gems instead. Once a gem crosses 180 million downloads, its owners are required to use multi-factor authentication in the future.

Bar graph showing gem download numbers for Gem 1 and Gem 2 — Gem downloads represented as bars. Gem 2 is over the 180M download threshold, so its owners would need MFA.

This design decision led me to a curiosity. As packages frequently depend on other packages, could some of these big (more than 180M downloads) packages depend on small (less than 180M downloads) packages? If this was the case, then there would be a small loophole: if a hacker wanted to maximize their reach in the Ruby ecosystem, they could target one of these small packages (which would get installed every time someone installed one of the big packages), circumventing the MFA protection of the big packages.

On the surface, it might not make sense that a dependency would ever have fewer downloads than its parent. After all, every time the parent gets downloaded, the dependency does too, so surely the dependency has at least as many downloads as the parent, right?

Screenshot of a Slack conversation between coworkers discussing one's scepticism about finding exceptions — My coworker Jacques, doubting that big gems will rely on small gems. He tells me he finds this hilarious in retrospect.

Well, I thought I should try to find exceptions anyway, and given that this blog post exists, it would seem that I found some. Here’s how I did it.

The Investigation

The first step in determining if big packages depended on small packages was to get a list of big packages. The rubygems.org stats page shows the top 100 gems in terms of downloads, but the last gem on page 10 has 199 million downloads, meaning that scraping these pages would yield an incomplete list, since the threshold I was interested in is 180 million downloads.

A screenshot of a page of Rubygems.org statistics — Page 10 of https://rubygems.org/stats, just a bit above the MFA download threshold

To get a complete list, I instead turned to using the data dumps that rubygems.org makes available. Basically, the site takes a daily snapshot of the rubygems.org database, removes any confidential information, and then publishes it. Their repo has a convenient script that allows you to load these data dumps into your own local rubygems.org database, and therefore run queries on the data using the Rails console. It took me many tries to make a query that got all the big packages, but I eventually found one that worked:

Rubygem.joins(:gem_download).where(gem_download: {count: 180_000_000..}).map(&:name)

I now had a list of 112 big gems, and I had to find their dependencies. The first method I tried was using the rubygems.org API. As described in the documentation, you can give the API the name of a gem and it’ll give you the name of all of its dependencies as part of the response payload. The same endpoint of this API also tells you how many downloads a gem has, so the path was clear: for each big gem, get a list of its dependencies and find out if any of them had fewer downloads than the threshold.

Here are the functions that get the dependencies and downloads:

Ruby function that gets a list of dependencies as reported by the rubygems.org API. Requires built-in uri, net/http, and json packages.

Ruby function that gets downloads from the same rubygems.org API endpoint. Also has a branch to check the download count for specific versions of gems, that I later used.

Putting all of this together, I found that 13 out of the 112 big gems had small gems as dependencies. Exceptions! So why did these small gems have fewer downloads than their parents? I learned that it was mainly due to two reasons:

Some gems are newer than their parents, that is, a new gem came out and a big gem developer wanted to add it as a dependency.
Some gems are shipped with Ruby by default, so they don’t need to be downloaded and thus have low(er) download count (for example, racc and rexml).

With this, I now had proof of the existence of big gems that would be indirectly vulnerable to account takeover of a small gem. While an existence proof is nice, it was pointed out to me that the rubygems.org API only returns a list symbolic of the direct dependencies of a gem, and that those dependencies might have sub-dependencies that I wasn’t checking. So how could I find out which packages get installed when one of these big gems gets installed?

With Bundler, of course!

Bundler is the Ruby dependency manager software that most Ruby users are probably familiar with. Bundler takes a list of gems to install (the Gemfile), installs dependencies that satisfy all version requirements, and, crucially for us, makes a list of all those dependencies and versions in a Gemfile.lock file. So, to find out which big gems relied in any way on small gems, I programmatically created a Gemfile with only the big gem in it, programmatically ran bundle lock, and programmatically read the Gemfile.lock that was created to get all the dependencies.

Here’s the function that did all the work with Bundler:

Ruby function that gets all dependencies that get installed when one gem is installed using Bundler

With this new methodology, I found that 24 of the 112 big gems rely on small gems, which is a fairly significant proportion of them. After discovering this, I wanted to look into visualization. Up until this point, I was just printing out results to the command line to make text dumps like this:

Text dump of dependency results. Big gems are red, their dependencies that are small are indented in black

This visualization isn’t very convenient to read, and it misses out on patterns. For example, as you can see above, many big gems rely on racc. It would be useful to know if they relied directly on it, or if most packages depended on it indirectly through some other package. The idea of making a graph was in the back of my mind since the beginning of this project, and when I realized how helpful it might be, I committed to it. I used the graph gem, following some examples from this talk by Aja Hammerly. I used a breadth-first search, starting with a queue of all the big gems, adding direct dependencies to the queue as I went. I added edges from gems to their dependencies and highlighted small gems in red. Here was the first iteration:

The output of the graph gem that highlights gem dependencies — The first iteration

It turns out there a lot of AWS gems, so I decided to remove them from the graph and got a much nicer result:

The graph, while moderately cluttered, shows a lot of information succinctly. For instance, you can see a galaxy of gems in the middle-left, with rails being the gravitational attractor, a clear keystone in the Ruby world.

Output of the gem graph with Rails at the center — The Rails galaxy

The node with the most arrows pointing into it is activesupport, so it really is an active support.

A close up view of activesupport in the output of the gem graph. activesupport has many arrows pointing into it. — 14 arrows pointing into activesupport

Racc, despite appearing in my printouts as a small gem for many big gems, is only the dependency of nokogiri.

A close up view of racc in the output of the gems graph — racc only has 1 edge attached to it

With this nice graph created, I followed up and made one final printout. This time, whenever I found a big gem that depended on a small gem, I printed out all the paths on the graph from the big gem to the small gem, that is, all the ways that the big gem relied on the small gem.

Here’s an example printout:

Big gem is in green (googleauth), small gems are in purple, and the black lines are all the paths from the big gem to the small gem.

I achieved this by making a directional graph data type and writing a depth-first search algorithm to find all the paths from one node to another. I chose to create my own data type because finding all paths on a graph isn’t already implemented in any Ruby gem from what I could tell. Here’s the algorithm, if you’re interested (`@graph` is a Hash of `String:Array` pairs, essentially an adjacency list):

Recursive depth-first search to find all paths from start to end

What’s Next

In summary, I found four ways to answer the question of whether or not big gems rely on small gems:

direct dependency printout (using rubygems.org API)
sub-dependency printout (using Bundler)
graph (using graph gem)
sub-dependency printout with paths (2. using my own graph data type).

I’m happy with my work, and I’m glad I got to learn about file I/O and use graph theory. I’m still relatively new to Ruby, so offshoot projects like these are very didactic.

The question remains of what to do with the 24 technically insecure gems. One proposal is to do nothing, since everyone will eventually need to have MFA enabled, and account takeover is still an uncommon event despite being on the rise.

Another option is to enforce MFA on these specific gems as a sort of blocklist, just to ensure the security of the top gems sooner. This would mean a small group of owners would have to enable MFA a few months earlier, so I could see this being a viable option.

Either way, more discussion with my team is needed. Thanks for reading!

Kevin is an intern on the Ruby Dependency Security team at Shopify. He is in his 5th year of Engineering Physics at the University of British Columbia.

Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Design.

The Investigation

What’s Next

Get stories like this in your inbox!

Ready to tackle frontend, backend, infrastructure, data, or security challenges?