Upgrading MySQL at Shopify

In early September 2021, we retired our last Shopify database virtual machine (VM) that was running Percona Server 5.7.21, marking the complete cutover to 5.7.32. In this post, I’ll share how the Database Platform team performed the most recent MySQL upgrade at Shopify. I’ll talk about some of the roadblocks we encountered during rollback testing, the internal tooling that we built out to aid upgrading and scaling our fleet in general, and our guidelines for approaching upgrades going forward, which we hope will be useful for the rest of the community.

Why Upgrade and Why Now?

We were particularly interested in upgrading due to the replication improvements that would preserve replication parallelism in a multi-tier replication hierarchy via transaction writesets. However, in a general sense, upgrading our version of MySQL was on our minds for a while and the reasons have become more important over time as we’ve grown:

  • We’ve transferred more load to our replicas over time, and without replication improvements, high load could cause replication lag and a poor merchant and buyer experience.
  • Due to our increasing global footprint, to maintain efficiency, our replication topology can be up to four “hops” deep, which increases the importance of our replication performance.
  • Without replication improvements, in times of high load such as Black Friday/Cyber Monday (BFCM) and flash sales, there’s a greater likelihood of replication lag that in turn heightens the risk to merchants’ data availability in the event of a writer failure.
  • It’s industry best practice to stay current with all software dependencies to receive security and stability patches.
  • We expect to eventually upgrade to MySQL 8.0. Building the upgrade tooling required for this minor upgrade helps us prepare for that.

To the last point, one thing we definitely wanted to achieve as a part of this upgrade was—to put it in the words of my colleague Akshay—“Make MySQL upgrades at Shopify a checklist of tasks going forward, as opposed to a full-fledged project.” Ideally, by the end of the project, we have documentation with steps for how to perform an upgrade that can be followed by anyone on the Database Platform team that takes on the order of weeks, rather than months, to complete.

Database Infrastructure at Shopify

Core

Shopify's Core database infrastructure is horizontally sharded by shop, spread across hundreds of shards, each consisting of a writer and five or more replicas. These shards are run on Google Compute Engine Virtual Machines (VM) and run the Percona Server fork of MySQL. Our backup system makes use of Google Cloud’s persistent disk snapshots. While we’re running the upstream versions of Percona Server, we maintain an internal fork and build pipeline that allows us to patch it as necessary.

Mason

Without automation, there’s a non-trivial amount of toil involved in just the day-to-day operation of our VM fleet due to its sheer size. VMs can go down for many reasons, including failed GCP live migrations, zone outages, or just run-of-the-mill VM failures. Mason was developed to respond to VMs going down by spinning up a VM to replace it—a task far more suited to a robot rather than a human, especially in the middle of the night.

Mason was developed as a self-healing service for our VM-based databases that was borne out of a Shopify Hack Days project in late 2019.

Healing Isn’t All That’s Needed

Shopify’s query workload can differ vastly from shard to shard, which necessitates maintenance of vastly different configurations. Our minimal configuration is six instances: three instances in Google Cloud’s us-east1 region and three instances in us-central1. However, each shard’s configuration can differ in other ways:

  • There may be additional replicas to accommodate higher read workloads or to provide replicas in other locations globally.
  • The VMs for the replicas may have a different number of cores or memory to accommodate differing workloads.

With all of this in mind, you can probably imagine how it would be desirable to have automation built around maintaining these differences—without it, a good chunk of the manual toil involved in on-call tasks would be simply provisioning VMs, which isn’t an enviable set of responsibilities.

Using Mason to Upgrade MySQL

Upgrades at our scale are extremely high effort as the current count of our VM fleet numbers in the thousands. We decided that building additional functionality onto Mason would be the way forward to automate our MySQL upgrade, and called it the Declarative Database Topologies project. Where Mason was previously used as a solely reactive tool that only maintained a hardcoded default configuration, we envisioned its next iteration as a proactive toolone that allows us to define a per-shard topology and do the provisioning work that reconciles its current state to a desired state. Doing this would allow us to automate provisioning of upgraded VMs, thus removing much of the toil involved in upgrading a large fleet, and automate scale-up provisioning for events such as BFCM or other high-traffic occurrences.

The Project Plan

We had approximately eight months before BFCM preparations would begin to achieve the following:

  • pick a new version of MySQL.
  • benchmark and test the new version for any regressions or bugs
  • perform rollback testing and create a rollback plan to so we can safely downgrade if necessary
  • finally, perform the actual upgrade.

At the same time, we also needed to evolve Mason to:

  • increase its stability
  • move from a global hardcoded configuration to a dynamic per-shard configuration
  • have it respond to scale-ups when the configuration changed
  • have it care about Chef configuration, too
  • … do all of that safely.

One of the first things we had to do was pick a version of Percona Server. We wanted to maximize the gains that we would get from an upgrade while minimizing our risk. This led us to choose the highest minor version of Percona Server 5.7, which was 5.7.32 at the start of the project. By doing so, we benefited from the bug and security fixes made since we last upgraded; in the words of one of our directors, “incidents that never happened” because we upgraded. At the same time, we avoided some of the larger risks associated with major version upgrades.

Once we had settled on a version, we made changes in Chef to have it handle an in-place upgrade. Essentially, we created a new Chef role with the existing provisioning code but with the new version specified for the MySQL server version variable and modified the code so that the following happens:

  1. Restore a backup taken from a 5.7.21 VM on an VM with 5.7.32 installed.
  2. Allow the VM and MySQL server process to start up normally. 
  3. Check the contents of the mysql_upgrade_info file in the data directory. If the version differs from that of the MySQL server version installed, run mysql_upgrade (via a wrapper script that’s necessary to account for unexpected behaviour of the mysql_upgrade script that exits with the return code 2, instead of the typical return code of 0, when an upgrade wasn’t required).
  4. Perform the necessary replication configuration and proceed with the rest of the MySQL server startup.

After this work was completed, all we had to do to provision an upgraded version was to specify that the new VM be built with the new Chef role.

Preparing for the Upgrade

Performing the upgrade is the easy part, operationally. You can spin up an instance with a backup from the old version, let mysql_upgrade do its thing, have it join the existing replication topology, optionally take backups from this instance with the newer version, populate the rest of the topology, and then perform a takeover. Making sure the newer version performs the way we expect and can be safely rolled back to the old version, however, is the tricky part.

During our benchmarking tests, we didn’t find anything anomalous, performance-wise. However, when testing the downgrade from 5.7.32 back to 5.7.21, we found that the MySQL server wouldn’t properly start up. This is what we saw when tailing the error logs:

When we allowed the calculation of transient stats at startup to run to completion, it took over a day due to a lengthy table analyze process on some of our shards—not great if we needed to roll back more urgently than that.

A cursory look at the Percona Server source code revealed that the table_name column in the innodb_index_stats and innodb_table_stats changed from VARCHAR(64) in 5.7.21 to VARCHAR(199) in 5.7.32. We patched mysql_system_tables_fix.sql in our internal Percona Server fork so that the column lengths were set back to a value that 5.7.21 expected, and re-tested the rollback. This time, we didn’t see the errors about the column lengths, however we still saw the analyze table process causing full table rebuilds, again leading to an unacceptable startup time, and it became clear to us that we had merely addressed a symptom of the problem by fixing these column lengths.

At this point, while investigating our options, it occurred to us that one of the reasons why this analyze table process might be happening is because we run ALTER TABLE commands as a part of the MySQL server start: we run a startup script that sets the AUTO_INCREMENT value on tables to set a minimum value (this is due to the auto_increment counter not being persisted across restarts, a long-standing bug which is addressed in MySQL 8.0).

Investigating the Bug

Once we had our hypothesis, we started to test it. This culminated in a group debugging session where a few members of our team found that the following steps reproduced the bug that resulted in the full table rebuild:

  1. On 5.7.32: A backup previously taken from 5.7.21 is restored.
  2. On 5.7.32: An ALTER TABLE is run on a table that should just be an instantaneous metadata change, for example, ALTER TABLE t AUTO_INCREMENT=n. The table is changed instantaneously, as expected.
  3. On 5.7.32: A backup is taken.
  4. On 5.7.21: The backup taken from 5.7.32 in the previous step is restored.
  5. On 5.7.21: The MySQL server is started up, and mysql_upgrade performs the in-place downgrade.
  6. On 5.7.21: A similar ALTER TABLE statement to step 1 is performed. A full rebuild of the table is performed, unexpectedly and unnecessarily.

Stepping through the above steps with the GNU Debugger (GDB), we found the place in the MySQL server source code where it’s incorrectly concluded that indexes have changed in a way that required a table rebuild (from Percona Server 5.7.21 in the has_index_def_changed function in sql/sql_table.cc):

We saw, while inspecting in GDB, that the flags for the old version of the table (table_key->flags above) don’t match that of the new version of the table (new_key->flags above), despite the fact that only a metadata change was applied:

Digging deeper, we found past attempts to fix this bug. In the 5.7.23 release notes, there’s the following:

“For attempts to increase the length of a VARCHAR column of an InnoDB table using ALTER TABLE with the INPLACE algorithm, the attempt failed if the column was indexed.If an index size exceeded the InnoDB limit of 767 bytes for COMPACT or REDUNDANT row format, CREATE TABLE and ALTER TABLE did not report an error (in strict SQL mode) or a warning (in nonstrict mode). (Bug #26848813)”

A fix was merged for the bug, however we saw that there was a second attempt to fix this behaviour. In the 5.7.27 release notes, we see:

“For InnoDB tables that contained an index on a VARCHAR column and were created prior to MySQL 5.7.23, some simple ALTER TABLE statements that should have been done in place were performed with a table rebuild after an upgrade to MySQL 5.7.23 or higher. (Bug #29375764, Bug #94383)”

A fix was merged for this bug as well, but it didn’t fully address the issue of some ALTER TABLE statements that should be simple metadata changes instead leading to a full table rebuild.

My colleague Akshay filed a bug against this, however the included patch wasn’t ultimately accepted by the MySQL team. In order to safely upgrade past this bug, we still needed MySQL to behave in a reasonable way on downgrade, and we ended up patching Percona Server in our internal fork. We tested our patched version successfully in our final rollback tests, unblocking our upgrade.

What are “Packed Keys” Anyway?

The PACK_KEYS feature of the MyISAM storage engine allows keys to be compressed, thereby making indexes much smaller and improving performance. This feature isn’t supported by the InnoDB storage engine as its index layout and expectations are completely different. In MyISAM, when indexed VARCHAR columns are expanded past eight bytes, thus converting from unpacked keys to packed keys, it (rightfully) triggers an index rebuild.

However, we can see that in the first attempt to fix the bug in 5.7.23, that the same type of change triggers the same behaviour in InnoDB, even though packed keys aren’t supported. To remedy this, from 5.7.23 onwards, the HA_PACK_KEY and HA_BINARY_PACK_KEY flags weren’t set if the storage engine didn’t support it.

That, however, meant that if a table was created prior to 5.7.23, these flags are unexpectedly set even on storage engines that didn’t support it. So upon upgrade to 5.7.23 or higher, any metadata-only ALTER TABLE commands executed on an InnoDB table incorrectly conclude that a full index rebuild is necessary. This brings us to the second attempt to fix the issue in which the flags were removed entirely if the storage engine didn’t support it. Unfortunately that second bug fix didn’t account for the case where the flags might have changed, but the difference should be ignored when evaluating whether the indexes need to be rebuilt in earlier versions, and that’s what we addressed in our proposed patch. In our patch, during downgrade, if the old version of the table (from 5.7.32) didn’t specify the flag, but the new version of the table (in 5.7.21) does, then we bypass the index rebuild.

Meanwhile, in the Mason Project… 

While all of this rollback testing work was in progress, another part of the team was hard at work shipping new features in Mason to let it handle the upgrades. These were some of the requirements we had that guided the project work:

  • The creation of a “priority” lane—self-healing should always take precedence over a scale-up related provisioning request.
  • We needed to throttle the scale-up provisioning queue to limit how much work was done simultaneously.
  • Feature flags were required to limit the number of shards to release the scale-up feature to, so that we could control which shards were provisioned and release the new features carefully.
  • A dry-run mode for scale-up provisioning was necessary to allow us to test these features without making changes to the production systems immediately.

Underlying all of this was an abundance of caution in shipping the new features. Because of our large fleet size, we didn’t want to risk provisioning a lot of VMs we didn’t need or VMs in the incorrect configuration that would cost us either way in terms of GCP resource usage or engineering time spent in decommissioning resources.

In the initial stages of the project, stabilizing the service was important since it played a critical role in maintaining our MySQL topology. Over time, it had turned into a critical component of our infrastructure that significantly improved our on-call quality of life. Some of the early tasks that needed to be done were simply making it a first-class citizen among the services that we owned. We stabilized the staging environment it was deployed into, created and improved existing monitoring, and started using it to emit metrics to Datadog indicating when the topology was underprovisioned (in cases where Mason failed to do its job).

Another challenge was that Mason itself talks to many disparate components in our infrastructure: the GCP API, Chef, the Kubernetes API, ZooKeeper, Orchestrator, as well as the database VMs themselves. It was often a challenge to anticipate failure scenarios—often, the failure experienced was completely new and wouldn’t have been caught in existing tests. This is still an ongoing challenge, and one that we hope to address through improved integration testing.

Later on, as we onboarded new people to the project and started introducing more features, it also became obvious that the application was quite brittle in its current state; adding new features became more and more difficult due to the existing complexity, especially when they were being worked on concurrently. It brought to the forefront the importance of breaking down streams of work that have the potential to become hard blockers, and highlighted how much a well-designed codebase can decrease the chances of this happening.

We faced many challenges, but ultimately shipped the project on time. Now that the project is complete, we’re dedicating time to improving the codebase so it’s more maintainable and developer-friendly.

The Upgrade Itself

Throughout the process of rollback testing, we had already been running 5.7.32 for a few months on several shards reserved for canary testing. A few of those shards are load tested on a regular basis, so we were reasonably confident that this, along with our own benchmarking tests, made it ready for our production workload.

Next, we created a rollback plan in case the new version was unstable in production for unforeseen reasons. One of the early suggestions for risk mitigation was to maintain a 5.7.21 VM per-shard and continue to take backups from them. However, that would have been operationally complex and also would have necessitated the creation of more tooling and monitoring to make sure that we always have 5.7.21 VMs running for each shard (rather toilsome when the number of shards reaches the hundreds in a fleet). Ultimately, we decided against this plan, especially considering the fact that we were confident that we could roll back to our patched build of Percona Server, if we had to.

Our intention was to do everything we could to de-risk the upgrade by performing extensive rollback testing, but ultimately we preferred to fix forward whenever possible. That is, the option to rollback was expected to be taken only as a last resort.

We started provisioning new VMs with 5.7.32 in earnest on August 25th using Mason, after our tooling and rollback plan were in place. We decided to stagger the upgrades by creating several batches of shards. This allowed the upgraded shards to “bake” and not endanger the entire fleet in the event of an unforeseen circumstance. We also didn’t want to provision all the new VMs at once due to the amount of resource churn (at the petabyte-scale) and pressure it would put on Google Cloud.

On September 7th, the final shards were completed, marking the end of the upgrade project.

What Did We Take Away from This Upgrade? 

This upgrade project highlighted the importance of rollback testing. Without the extensive testing that we performed, we would have never known that there was a critical bug blocking a potential rollback. Even though needing to rebuild the fleet with the old version to downgrade would have been toilsome and undesirable, patching 5.7.21 gave us the confidence to proceed with the upgrade, knowing that we had the option to safely downgrade if it became necessary.

Also Mason, the tooling that we relied on, became more important over time. In the past, Mason was considered a lower-tier application, and simply turning it off was a band-aid solution to when it was behaving in unexpected ways. Fixing it wasn’t often a priority when bugs were encountered. However, as time has gone by, we’ve recognized how large of a role it plays in toil-mitigation and maintaining healthy on-call expectations, especially as the size of our fleet has grown. We have invested more time and resources into it by improving test coverage and refactoring key parts of the codebase to reduce complexity and improve readability. We also have future plans to improve the local development environments and streamline its deployment pipeline.

Finally, investing in the documentation and easy repeatability of upgrades has been a big win for Shopify and for our team. When we first started planning for this upgrade, finding out how upgrades were done in the past was a bit of a scavenger hunt and required a lot of institutional knowledge. By developing guidelines and documentation, we paved the way for future upgrades to be done faster, more safely, and more efficiently. Rather than an intense and manual context-gathering process every time that pays no future dividends, we can now treat a MySQL upgrade as simply a series of guidelines to follow using our existing tooling.

Next up: MySQL 8!

Yi Qing Sim is a Senior Production Engineer and brings nearly a decade of software development and site reliability engineering experience to the Database Backend team, where she primarily works on Shopify’s core database infrastructure.


Wherever you are, your next journey starts here! If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Intrigued? Visit our Engineering career page to find out about our open positions and learn about Digital by Default.