Resiliency is important at Shopify and although we have many tools to help us during incidents, some problems need to be fixed at the database level. At Shopify, we use ProxySQL to manage connections between applications and our MySQL databases. ProxySQL accepts connections from applications and distributes traffic across MySQL databases, improving performance.
ProxySQL comes with a powerful feature called query rules. The main use of these rules at Shopify is to reroute, rewrite, or reject queries matching a specified regex. However, with great power comes great responsibility. These rules are powerful and can have unexpected consequences if used incorrectly. At Shopify’s scale, we’re running thousands of ProxySQL instances, so applying query rules to each one is a painful and time consuming process, especially during an incident. We’ve built a tool to help us address these challenges and make deploying new rules safe and scalable.
Types of Query Rules
Databases can be overloaded by high traffic, faulty code, or malicious spam. With ProxySQL’s query rules, you can respond to these issues quickly and effectively by rerouting, rewriting, or rejecting queries.
Reroute rules allow us to send matched queries to a certain hostgroup: a pool of interchangeable servers that look like a single data source from the application’s point of view. For example, our read replicas are all in one hostgroup, so if you want certain queries to go to read replicas, you create a query rule to route them there.
Recently, reroute rules were very vital for us at Shopify when a team created query rules that sent certain queries to read replicas to investigate scaling bottlenecks. After measuring the performance impact of this, the team made a code change so the queries that proved to perform better on read replicas were permanently sent there. Without query rules, this process would be more time consuming since iterating through code changes takes much longer than using query rules.
Rewrite query rules are helpful in incidents if queries need to be modified and deploying a code change takes too long. For example, if a database column’s name gets changed, but your code still references the old column name, you would experience errors. While waiting for a code change to reference the correct column name to be deployed, a temporary rewrite rule can be used in the meantime. This is an example of what inserting a rewrite rule would look like:
INSERT INTO mysql_query_rules
(rule_id, active, match_pattern, replace_pattern) V
ALUES (123, 1, "users\.name", "users.first_name");
Query rules for rejecting queries are useful if we need to protect our databases from malicious queries or a misbehaving code change.
Dangers of Query Rules
ProxySQL’s query rules are very flexible and powerful, but can cause serious issues if used incorrectly. For example, you could create a rule that rejects more queries than you expected or rewrites queries differently than expected leading to data corruption and downtime.
Dry Running Query Rules
These problems could be avoided if you could safely dry run query rules. For example, to rewrite queries matching a certain regex, you create a rule in dry run mode and preview the result. The rule wouldn’t actually perform its action, it would simply log matched queries and the action that would have been taken. Once an author is confident that their rule is working correctly, they change the rule from dry run mode to active mode, where the rule actually takes action.
ProxySQL lacks a dry run feature, so we decided to add it to our fork of ProxySQL. We’re looking to contribute it upstream so that the ProxySQL community will be able to dry run their query rules. You can find a pull request with the changes we made to our fork in ProxySQL's repo.
ProxySQL stores configuration and statistics in an internal SQLite database and the query rules live in the mysql_query_rules table. We use the
log fields of ProxySQL’s mysql_query_rules table to define what mode a query rule is in.
With this patch to our fork of ProxySQL, we now run query rules in dry run mode, and see what queries are affected and their results. Once we’re comfortable with the rule, we simply change it to active mode and it does its operations. If a rule isn’t acting like we expected it to, we can fix the rule without having to worry about it doing anything dangerous.
Preventing Logging Degradation
A question to our approach might be what if all this logging causes ProxySQL’s performance to degrade? In most cases this doesn’t happen, but we considered the case of someone writing a rule that had too broad of a matching regex pattern that causes thousands of queries per second to be logged. As part of the modification to our ProxySQL fork, we had an exponential backoff to these logs. So only the 1st, 2nd, 4th, 8th, and so on matched queries get logged, helping us mitigate any issues that come from a really broad query rule logging too much.
Dynamic Query Rules
Now that you know about how we use query rules and mitigate the dangers of query rules, we’ll take a look at how we built a tool to apply these rules to thousands of ProxySQL instances.
As mentioned above, adding query rules to ProxySQL is done using
INSERT statements. Similarly, modifying and deleting query rules is done by using
DELETE SQL statements. Since Shopify uses thousands of ProxySQL replicas, if someone created a rule, they have to do it in each replica. And then, if they want to edit or delete the rule, they have to do the same thing. This is ok if you only need to do this on a couple replicas, but doing so for thousands of replicas isn’t scalable.
We built a tool to dynamically create, modify, and delete query rules. Developers enter parameters for their rule via a web app. The app then updates our global key-value store. Shopify’s key-value store is used by many applications for control plane use-cases. We run a sidecar container alongside ProxySQLs that polls our key-value store for changes to our ruleset and applies necessary changes to ProxySQL. The sidecar makes the necessary changes if a rule is added, edited, or deleted.
Query Rules in Distributed Systems
In a perfect world, the query rules system works fine with no issues. However, in distributed systems many things could go wrong. For example, the key-value store goes down or communication between the sidecar and ProxySQL containers fails. When we designed our system we took all of these failure modes into account. You can read more about failure modes and how resiliency planning is done at Shopify in the post Resiliency Planning for High-Traffic Events.
We designed the system so that the sidecar container caches a list of rules. When polling the key-value store, the sidecar compares its list with the key-value store’s list. The key-value store is the source of truth, so the sidecar makes any changes needed to stay in sync with the key-value store. The sidecar then updates ProxySQL’s list of rules by modifying ProxySQL’s mysql_query_rules table using SQL statements.
If the key-value store ever went down, the sidecar continues operating normally, it just wouldn’t pick up any new changes to the list of rules. Once the key-value store is back up, the sidecar picks up any changes.
If communication between the sidecar and ProxySQL containers fails or they go out of sync for whatever reason, we wipe both ProxySQL’s list of rules as well as the sidecar’s list of rules. In the next polling loop, the sidecar fetches the rules from the key-value store and adds the rules to ProxySQL. Although rules are removed for a brief amount of time, we favour consistency of the list of rules across ProxySQLs rather than availability of the rules on each shard. We rather have a rule get removed for a brief amount of time than having split-brain issues.
Another mechanism we used to prevent issues if the key-value store went down was to add a time-to-live (TTL) to rules. Developers optionally specify how long a rule lasts before it automatically gets removed. Even if the key-value store went down, the sidecar still removes the rule once it’s TTL expires.
There are rules that need to be 100% available. In these cases, we create the rules on startup in a proxysql.cnf file. Rules created on startup are static, whereas rules created by our tool are dynamic.
Although this tool is most often used as a last resort, it is very valuable for experimentation and is extremely useful in cases where it is needed. Creating rules quickly and safely at Shopify’s scale would not be possible without this project. Hopefully this post taught you about how useful ProxySQL query rules can be and gives you an idea of how to implement a similar system.
Rahul Rangith is a Production Engineer Intern on the Database Connection Management team. He is currently a student in his final year of Computer Engineering at the University of Waterloo.
If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.