An Introduction to DNS Traffic Management

Distributed systems are only as resilient as we build them to be. Domain Name System (DNS) traffic management is a well-used approach to do so. In this first part of a two-part series, we aim to give a broad overview of DNS and how it’s used for traffic management, as well as the different reasons why we want to use DNS traffic management.

If you already have context on what is DNS, what is traffic management, and the reasons why you would need to use DNS traffic management, you can skip directly to where we share our journey and improvements made regarding DNS traffic management at Shopify in Part 2: Shopify’s DNS Traffic Management.

A Summarized History of DNS

Everything started with humans trying to communicate and a plain text file, even before the advent of the modern internet.

The Advanced Research Projects Agency Network (ARPANET) was thought, in 1966, to enable access to remote computers. In 1969, the first computers were connected to ARPANET, followed by the implementation of the Network Control Program (NCP) in 1970. Guided by the need to connect more and more computers together, and as the work on the Transmission Control Protocol (TCP), started in 1974, evolved, TCP/IP was created in the late 1970s to provide the ability to join separate networks in a network of networks and replaced NCP in ARPANET on January 1st, 1983.

At the beginning of ARPANET there were just a handful of computers from four different universities connected together, which was easier for people to remember the addresses. This became challenging with new computers joining the network. The Stanford Research Institute provided, through file sharing, a manually maintained file containing the hostnames and related addresses of hosts as provided by member organizations of ARPANET. This file, originally named HOSTS.TXT, is now also broadly known as the /etc/hosts file on Unix and Unix-like systems.

A growing network with an increasingly large number of computers meant an increasingly large file to download and maintain. By the early 1980s, this process became slow and an automated naming system was required to address the technical and personnel issues of the current approach. The Domain Name System (DNS) was born, a protocol converting human-readable (and rememberable) domain names into Internet Protocol (IP) addresses.

What is DNS?

Let’s consider that DNS is a very large library where domains are organized from the least to the most meaningful parts of their names. For instance, if you (the client) want to resolve, you would consider that .com is the least meaningful part as it’s shared with many domain names, and shops the most meaningful part as it’s a specification on the subdomain you’re requesting. Finding in this DNS Library would thus mean going to the .com shelves and finding the myshopify book. Once the book in hand, we would then open it to the shops page, and see something that looks like the following:

The image is telling us that corresponds to the IP address
DNS Library Book

The image is telling us that corresponds to the IP address Also, our DNS Library provides us with the equivalent of a Due Date, which is called Time To Live (TTL). It corresponds to the amount of time the association of hostname to IP address is valid. We remember or cache that information for the given amount of time. If we need this information after expiration, we have to “find that book” again to verify if the association is still valid.

The opposite concept already exists: if you’re trying to find a page in the book and can’t find it, chances are that you won’t wait there until someone writes it down for you. In DNS, this concept is driven by the Negative TTL, which represents the duration we consider a NXDOMAIN (non-existing domain) answer can be cached. This means that the author of a new page in this book cannot consider their update is known by everyone until that period of time has elapsed.

Another relevant element is that the DNS Library doesn't necessarily hold only one book but multiple identical ones, from different editors, enabling others to be consulted if one copy is unavailable.

In DNS terms, the editors are DNS providers. The shelves contain multiple sections for each domain nameserver, the servers that provide DNS resolution as a source of truth for a domain. The books are zones in the domain nameservers, and the book pages are DNS records, the direct relation between the queried record and the value it should resolve to. The DNS Library is what we call root servers, a set of 13 nameservers (named from a to m) that hold the keys to the root of the hostnames. The root servers are responsible for helping to locate the shelves, the nameservers of the Top Level Domains (TLD), the domains at the highest level in the hierarchy for DNS.

What is Traffic Management?

Traffic management is a key branch of logistics that aims to plan and control everything required to provide for the safe, orderly, and efficient movement of persons and goods. Traffic management helps to manage situations such as congestion or roadblocks, by redirecting traffic or sharing traffic between multiple routes. For instance, some navigation applications use data they get from their users (current location, current speed, etc.) to know where congestion is happening and improve the situation by suggesting alternative routes instead of sending them to the already overloaded roads.

A more generic description is that traffic management uses data to decide where to direct the traffic. We could have different paths depending on the country of origin (think country border waiting lines for the booths, where the checks are different depending on the passport you hold), different paths depending on vehicle size (bike lanes, directions for trucks vs. cars, etc.) or any other information we find relevant.

DNS + Traffic Management = DNS Traffic Management

Bringing the concept of traffic management to DNS means serving data-driven answers to DNS queries resulting in different answers depending on the location of the requester or for each request. For instance, we could have two clusters of servers and want to split the traffic between the two: we can decide to answer 50% of the requests with the first cluster and the other 50% with the second. The clients obtaining the answers would connect to the cluster they got directed to, without any other action on their part.

DNS queries are cached to avoid overloading servers with queries.

However, from the previous section, DNS queries are cached to avoid overloading servers with queries. Each time a query is cached by a resolver, it won’t be repeated by that resolver for the duration of its TTL. Using a low TTL will make sure that the information is kept around but not for too long. For example, returning a TTL of 15 seconds means that after 15 seconds the client needs to resolve the record again, and can get a different answer than before.

A low TTL needs careful consideration, as the time it takes to obtain the DNS record’s content from the DNS servers, called DNS resolution time, sometimes can dominate the time it takes to retrieve a resource like a webpage. The connection performance and accuracy of the result are thus often at odds. For instance, if I want my changes to appear to users in at most 15 seconds (hence setting a 15 seconds TTL), but the DNS resolution time takes 1 second means that every 15 seconds the users will take 1 more second to reach the service they are connecting to. Over a day, this added resolution time adds up to 5760 seconds, or 1 hour and 36 minutes. If we slightly sacrifice the accuracy by moving the TTL to 60 seconds, the resolution time becomes 1440 seconds over a day, or only 24 minutes, improving the overall performance.

The use of caching and TTL implies that doing DNS traffic management isn’t instant. There's a short delay in refreshing the record that should be at most the TTL that we configured. In practice, it can be slightly more as some DNS resolvers, unbeknownst to the client, might cache the results for a longer time than they see fit. The override of TTL shouldn’t happen often, however, but it’s something to be aware of when choosing DNS to do traffic management.

Examining Four DNS Traffic Management Use Cases

DNS traffic management is interesting when handling systems that don’t necessarily hold load balancer capabilities at the network level, either through an IP-level load balancer or any front-facing proxy, i.e. once already connected to the service we are trying to reach. There are many reasons to use DNS traffic management in front of services, and multiple reasons why we use it at Shopify.

Easy Failover

One of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reasonOne of Shopify’s use cases is easily failing over a service when the live instance crashes or is rendered unavailable for any reason. Using DNS management and having it ready to target two clusters, but using one by default, simply redirects the traffic to the second cluster whenever the first one crashes, it then redirects back the traffic when it recovers. This is commonly called active-passive. If you’re able to identify the unavailability of your main cluster in a timely fashion, this approach makes it almost seamless (considering the TTL) to the clients using the service, as they’d use the still-working cluster while the issue is solved, either automatically or through the intervention of the responsible on-call team (as a last resort). The pressure is relieved on those on-call teams, as they know that clients can still use the service while they solve the issues, sometimes even pushing the work to be done to the next working day.

Share traffic between endpoints

Share traffic between endpoints

Services inevitably grow and end up receiving requests from many clients. Now those requests need to be shared between available endpoints offering the exact same service. This is called active-active. Another motivation behind this approach is money related, when using external vendors (an external company contracted to provide your users a service) with minimum commitment, allowing you to share your traffic load between those vendors in a way that ensures reaching those commitments. You define the percent of traffic sent to each given endpoint corresponding to the percentage of DNS requests answered with that endpoint.

Deploy a Change Progressively

DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated

When developing production services, sometimes making a potentially disruptive change (such as deploying a new feature, changing the behavior of an existing one, or updating a system to a new version) is needed. In such cases, deploying your change and crossing your fingers while hoping for success is, at best, risky. DNS traffic management can help by allowing movement of a small percentage of your traffic to a cluster that’s already updated, then move more and more chunks of traffic until all of the traffic has been moved to the cluster with the new feature. This approach is called green-blue deployment. You can then update the other cluster, which allows you to be ready for the next update or failover.

Regionalize Traffic Decisions

geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world

You might find cases where some endpoints are more performant in some regions than others, which might happen when using external vendors. If performance is important for users, as it is for Shopify’s merchants and their customers, then you want to make sure the most performant endpoints are used for users in each region by allowing DNS answers based on the client’s location. Most of the time, geolocation can be fine-grained to the country, state or province level, or applied to a broader region of the world. Routing rules are defined to indicate what should be answered depending on the origin of requests. Once done, a client connecting from a location will get the answer that fits them. 

Our DNS traffic management journey took us from many manually set-up, maintained, and updated traffic management approaches to a fully automated self-served system used by more than 40 domains owned by more than 12 different teams, and handling more than 100M requests per 24h. If building systems from the ground up to solve real-world problems interests you, our Engineering blog has stories about other challenges we have encountered. Visit our Engineering career page to find out about our open positions. Join our remote team and work (almost) anywhere. Learn about how we’re hiring to design the future together - a future that is digital by default.