Shopify has grown significantly over the years, and our success makes us an attractive target for malicious actors. We take the safety of our merchants seriously, so we have a good reason to continuously improve the security at Shopify.

I’ll share how the Ruby Conventions team, which focuses on creating conventions to make Ruby services sustainable, used an iterative approach to solve complex problems at scale while responding to shifting circumstances. In particular, how we solved the dependency confusion vulnerability in over 600 Ruby applications, developed tooling that allows us to do large-scale migration with ease, and made the Ruby community a bit safer.

Understanding the Dependency Confusion Problem

Shopify runs a bug bounty program where we pay people to find vulnerabilities on our platform and learn what we have to improve on. One such report showed that we were vulnerable to a dependency confusion vulnerability that could give an attacker access to our local, continuous integration/continuous deployment (CI/CD), and production environments.

The vulnerability leverages the ambiguity of a package source to install malicious dependencies. If an external package is created with a higher version number under the same name as an internal Shopify package, the external dependency is resolved instead of the internal dependency.

In Ruby, developers use Bundler to manage their dependencies and make their environments reproducible. Bundler resolves dependencies so that you use the correct versions and sources for each gem. The Bundler team fixed the issue by introducing a new Gemfile.lock file format that’s created by a fresh install or an update. The new format assigns each gem to an explicit source:

However, at that time, the new format required you to upgrade. That meant Bundler updated all dependencies in the lockfile that would require vetting each update and testing the application for regressions in behavior.

Identifying the Impact

We didn’t know how many applications were susceptible to the dependency confusion vulnerability that made it hard to assess the impact of the problem. Our first step was to disambiguate the situation, so we could understand the problem better.

Disambiguating unknowns doesn’t need to be fancy, and it’s better to have some insight than none. In our case, we defined a cron job in our CI system to get the Bundler version information from all repositories into our data lake. It turned out that around 600 Ruby applications were susceptible to the dependency confusion vulnerability.

Having that data also allowed us to create a metric of outstanding migrations and measure progress towards solving our problem. It’s also a great way of detaching the solution from the goal, which is less constraining.

Changing Assumptions Through Experimentation

As developers, our solution has to take quite a few constraints into account. When developing software iteratively, we try to change some of those constraints and reevaluate our solution quickly. Making those changes as soon as possible surfaces unknowns increasing the likelihood of a successful project.

In our case having over 600 repositories to migrate meant that manually migrating every application would be too time-consuming. Requiring teams to do it themselves would be tedious and error-prone because the Gemfile.lock file couldn’t be automatically updated while keeping the current gem versions. In that case, developers would need to modify the lockfile to revert the versions updates back to prevent regressions from being introduced.

If we were able to update a Gemfile.lock to the new format without updating dependencies, it would enable us to automate rolling this upgrade out to all Ruby applications in Shopify. We would only rely on the application owners to deploy the changes.

We experimented with building a Bundler plugin (a gem that extends Bundler’s functionality) to automate the upgrade. It updated the Gemfile.lock file to the new format without updating dependencies. The plugin boiled down to:

Initializing the specification for a given Gemfile.lock file that contains information about the gems such as the name, the version, and remote.
Updating the Gemfile.lock file to the new lockfile format that updates all gems in the process. We minimize updates by only permitting patch version updates.
Replacing the versions in the updated Gemfile.lock file with the gem versions from the old Gemfile.lock file.

This approach wasn’t a perfect solution, but it worked well enough to run Bundler migrations. It allowed us to proceed to the next problem area of migrating large numbers of applications.

Running Migrations at Scale

One of the biggest challenges in running large-scale migrations is handling edge cases. Rather than exploring how migrations can go wrong beforehand, it’s more effective to migrate a handful of applications and discover the actual problems. The other benefit is that we can identify and migrate the subset of applications with issues that have known solutions while resolving the edge cases at the same time. This approach allows us to constantly deliver on our goals and put ourselves in a better spot each day.

Our Bundler plugin migrated the lockfile without dependency updates, and then we could start migrating applications. We started out running the plugin on a handful of applications that weren’t merchant-facing. This went smoothly, and we decided to run it on a larger batch for non-critical repositories. However, we noticed issues arising from inconsistent build setups, Ruby versions, and other configurations in the larger batches of migrations.

Some of our tooling didn’t support the latest Bundler version, and we had to work with our deployment, CI, and local environments teams to update them. Our collaborations were particularly fruitful when we:

investigated the issue first
tried to solve it
shared the context with the team.

Most people want to help and making it easy for them benefits everyone.

Some of our Docker images are built with Heroku’s Ruby buildpack that didn’t support the required Bundler version. This situation rendered a percentage of applications unable to migrate. To solve this issue, we worked with the Heroku Buildpack team to adopt the latest Bundler version. They released a new version with the bundler update, making it broadly available in the Ruby community.

Another critical element was raising awareness with project owners and setting a deadline to deprecate the old Bundler version. Being upfront with owners and communicating the impact of the change allowed teams to prioritize and work with us to update their projects.

The Bundler migration plugin was run locally, but scalability issues arose. It became too complicated to manage different Ruby versions, parallelize them, and address failures. Instead of wasting time on building a solution that would have solved all eventualities at the start, we used the migration plugin to its breaking point, investigated the problem areas, and implemented improvements.

As a response to our scaling issues, we built a command-line interface (CLI) tool on top of our CI system to set up the right environment for a repository, run commands on it, and open a pull request (PR) based on the changes made. Having an environment per repository worked great because we didn’t run into misconfiguration problems anymore. Using our CI system also allowed us to parallelize the execution, which in turn, sped the process up. Furthermore, migration failures were easier to recover and track.

Preventing Future Problems

Part of iteratively solving a problem means focusing on current problems rather than future concerns. However, it doesn’t mean ignoring future concerns altogether. It’s important to distinguish between critical concerns and ones that can be figured out later on.

One example was preventing a Gemfile.lock file from regressing to its previous format that would make us vulnerable. We were aware of the possibility of regressions, but we also knew that we could build tooling to solve this issue. Instead of investing time in tackling the problem upfront, we decided to wait and start working on it once we migrated most applications. This approach also allowed us to gauge the magnitude of the problem rather than wasting resources working out hypotheticals.

We encountered a handful of regressions during our migration and were a bit concerned. We investigated each manually to see if there were bigger problems present. Since we didn't find anything suggesting deeper problems, we carried on and continued monitoring knowing that if we ran into more regressions, we had more information to change course and face the new reality.

We investigated the lockfile regression problem and shared what we learned with the Bundler team. They enhanced the tool to prevent these cases from occuring in the future. We didn’t need to implement special tooling to prevent regressions (it saved us a lot of work and time). We only had to make sure that all applications were using the correct Bundler version.

Most of our applications were migrated to the Bundler version that didn’t prevent regressions because we staggered the migration to make continuous progress. Since we battle-tested our migration tooling and resolved most configuration issues, it allowed us to migrate all of our applications to the latest Bundler version in less than a day.

Rather than waiting for the perfect solution, making iterative changes improved our tooling to the point where we made changes that used to be hard, easy. This de-risked the deployment.

To prevent the installation of malicious gems, we made changes to our local environment tooling to ensure it always defaults to the recommended Bundler version. This ensures that an individual developer machine isn’t susceptible to running malicious code from the dependency confusion vulnerability. We also started failing CI whenever it encountered an out-of-date Bundler version, ensuring that any code change that could introduce the dependency confusion vulnerability wouldn’t be merged. Since most of our other automated processes require CI to execute, we rely on CI to catch vulnerable Bundler versions.

Sharing What We've Done with the Community

We love open source at Shopify, and we like giving back to the community. When contributing, it is quite valuable to share the purpose as well as the solution. It leads to insightful conversations that result in a better solution. Often, contributions aren’t solely PRs. Providing context on investigative work, bringing problems to someone's attention, or testing another contributor’s prototypes are just as valuable.

Our plugin worked pretty well for us, so we created a proposal in Bundler to fix the issue for the Ruby community. These changes would allow Bundler to update the Gemfile.lock file without upgrading gems in the process. Our proposal didn’t make it in, but led to a conversation resulting in an alternative approach that was shipped in Bundler 2.2.21. We helped test their approach on our applications to ensure that we caught as many edge cases as possible to help minimize the potential burden on the community.

We also ran into issues where developers using an insecure version of Bundler could accidentally revert to the old lockfile format. The problem was that the latest Bundler version (at the time) still resolved the old Gemfile.lock file on `bundle install`, which made it very simple to regress to the old format. We created a prototype to prevent that from happening that sparked another conversation with the maintainers of Bundler and brought the issue to their attention. They released version 2.2.22 of Bundler that prevents regressions and makes everybody in the community more secure.

We set out to fix the dependency confusion vulnerability in every Ruby project at Shopify and succeeded. This wouldn’t have been possible if we hadn’t followed an iterative approach that allowed us to make steady progress while taking shifting circumstances into account. We developed tooling that allows us to do large-scale migration, which has come in handy for other uses. We also aggregated Bundler version data on our Ruby projects to track adoption and make future decision-making easier. Lastly, we have worked closely with the Bundler team to improve the base functionality while leveraging Shopify’s scale to find edge cases, fix bugs, improve Bundler, and make it better for everyone in the Ruby community.

Frederik is a production engineer at Shopify and part of the Ruby & Rails infrastructure team. He contributed to massively scaling Shopify’s CI/CD system and making Ruby services more secure across Shopify and the Ruby community.