Shopify’s journey to faster breadth-first GraphQL execution (2026)

GraphQL powers Shopify’s data layer of commerce. We use it to serve deeply nested queries that scale geometrically—like fetching 250 products with 250 variants each—creating fan-out that GraphQL APIs frequently guard against. We support these patterns to make technology work for our merchants, never the other way around.

However, such high-cardinality patterns do present real scaling challenges, and when we dug into traces we found an unexpected bottleneck: the majority of request time wasn't necessarily spent performing I/O—it was frequently spent running field resolvers that built the GraphQL response.

The main culprit was GraphQL's conventional depth-first execution model and its hidden scaling costs. So we built something new: GraphQL Cardinal, a breadth-first execution engine that resolves each field once across all objects instead of once per object.

The result? Large list queries may see 15x faster execution with 90% less memory used, which can shave many seconds off P50 times, and we’re still discovering Cardinal’s full potential.

This post is an open letter to the GraphQL community. We'll walk through the hidden costs that we’ve observed in depth-first traversal, the breadth-first hypothesis that led to Cardinal, how the engine works internally, and what it takes to migrate a massive production stack to an entirely new execution model.

Problems of scale

Shopify's GraphQL-powered data layer supports deeply-nested structures with fan-out that GraphQL APIs frequently guard against. For example:

Nested lists like this tend to perform poorly as they scale geometrically, and examining traces for such queries often reveals a surprising bottleneck in our stack: the majority of request time may be spent running field resolvers that assemble the response, not loading the data itself.

The more we studied this issue, the more it became clear that our central problem was GraphQL execution’s bias towards depth-based recursion and its hidden costs. This led us to reconsider the entire design of GraphQL execution, and develop an entirely new breadth-based model that is better optimized for our business.

Before we embark on this journey though, let’s take a moment to define what GraphQL “depth” and “breadth” mean in this context.

Depth describes the static size of a GraphQL request document. It considers the number of fields selected, and how they’re nested. This dimension is fixed, and we expect “very large” document field sizes to be in the low hundreds.
Breadth describes the dynamic width of the resolved data, which scales by the number of objects returned across list fields. This dimension is highly variable, and “very large” sizes may be in the tens or even hundreds of thousands of objects.

The hidden costs of depth traversal

Conventional GraphQL engines perform execution using depth-first traversal:

What that means is the engine descends recursively through each object’s subtree before moving on to the next. In the flow above, we resolve a product in a list, and then all of its child variants, and then advance to the next product in the list, and then repeat.

Almost every GraphQL implementation uses this depth-first pattern, including the canonical graphql-ruby gem that we have used since 2015, and the official graphql-js spec implementation that it follows. In our experience running this execution model with Ruby, we’ve found that it scales poorly.

Cost: linear scale

The primary hidden cost of depth-based traversal is that it lacks the opportunity to amortize CPU-bound processing across subtrees, as seen in this stack profile:

This profile shows GraphQL processing a list of one hundred products, each with one hundred child variants. We can see a distinct “column” pattern emerge during field execution, where each column is the slice of time spent traversing a single product’s subtree. These columns are independent—subtree processing is not amortized, so the time to process 100 similarly-sized products is simply the time of one multiplied by 100.

This is linear time complexity that scales directly with the size of the response, and it is baked into GraphQL’s conventional execution design.

Cost: field-level overhead

Linear scale then amplifies another problem: each GraphQL field execution carries some non-zero overhead cost—for engine methodology, authorization, instrumentation, etc. Some of these costs are of our own making, while others are inherent to the GraphQL engine.

For example, an empty field-level tracing hook running on 1K fields was about 10% slower in our stack; just adding the field wrapper created overhead. These tiny field costs are elusive and tend to slip through between profiling frames, which makes them difficult to measure in total, even with left-heavy profiling.

We incur these costs for every field of every object in depth-based execution, and this multiplicative overhead can balloon into entire seconds of CPU-bound execution time—we’ll show you that below.

Cost: lazy dataloader promises

One particular field-level overhead deserves a special mention: dataloader promises. Dataloaders are an essential tool for solving GraphQL's N+1 problems. Rather than performing separate I/O for each of N fields, we instead resolve a promise for each field while pooling their lookup criteria, then lazily load all criteria at once, and then fulfill each promised value.

While dataloaders are good for optimizing I/O-bound performance, they come with steep memory and CPU performance tradeoffs because they incur a bloat of promise allocations, create Garbage Collector (GC) backpressure, and add execution backtracking. Resolving 1K lazy fields through a graphql-batch workflow with no I/O ran ~2.5x slower than the equivalent non-lazy fields in our stack.

The breadth-first hypothesis

These problems with depth-based execution led us to consider an alternative strategy: what if all field executions ran breadth-first instead? What if we performed a single pass down the request document and only executed field resolvers one time each with an aggregated breadth of objects?

To make this work, we’d change field resolvers to each receive a set of objects, and return a mapped set of results.

This interface is similar to Airbnb’s batched resolvers, but our underlying implementation would make breadth batching a native function of the engine rather than wrapping depth traversal in dataloaders. We’re also operating at a subgraph execution level, which makes this considerably different from Wundergraph’s breadth batching of federated supergraph partials.

Theoretically, resolvers in this breadth-based system should run longer and hotter on business logic, with no platform overhead for field repetitions. Individual fields would be implicitly batched, and multiple fields sharing I/O could run dataloaders that bind entire object sets to a single promise rather than building one-to-one promises.

Simply by the napkin math, this breadth-based approach looked promising.

The napkin math

Assumption: all GraphQL fields have some non-zero overhead cost associated with their execution. For simplicity, let's round up and say this cost is 1ms (which is quite pessimistic).

Scenario: we resolve five fields (depth) across a list of 1,000 objects (breadth):

depth-first: we call 5,000 field resolvers (depth × breadth) and incur 5s of cost (5 × 1000 × 1ms)
breadth-first: we call 5 field resolvers (depth-only) and incur only 5ms (5 × 1ms)

Now assume each field operates lazily and returns a promise:

depth-first: we build and resolve 5,000 intermediary promises (depth × breadth)
breadth-first: we build and resolve 5 intermediary promises (depth-only)

Now assume we chain a .then onto the lazy promise resolution:

depth-first: we run 10,000 promise callbacks (depth × breadth × 2)
breadth-first: we run 10 promise callbacks (depth × 2)

By these basic figures, breadth-first execution should scale lists more favorably by removing our largest dimension as a multiplying factor from platform overhead costs.

From whitepaper to engine

This hypothesis led us to prototype a new GraphQL engine optimized for high-cardinality set execution: GraphQL Cardinal.

The engine was built as a standalone execution wrapper around the static GraphQL Ruby primitives that we were already using (schemas, ASTs, etc.). The original proof-of-concept of Cardinal’s core algorithm can be found in graphql-breadth_exec.

For our initial experiments, we fed flat JSON data with 5K fields into Cardinal and GraphQL Ruby, and had each engine process the same structure back out. Cardinal’s CPU-bound execution speed was ~15x faster and used 90% less memory, which was very encouraging!

However, these benchmarks require scrutiny because not all requests will benefit equally from a breadth-first strategy. It’s important to understand how breadth advantages scale by repetition:

The above study compares a 7-deep object subtree with varying degrees of list repetition. In the first case with only one list item, there is no breadth repetition and we see that depth-based execution wins out by a slim margin (which is negligible when only repeated once). However, scaling up to a list of two starts to demonstrate breadth’s advantage—and this advantage scales dramatically as repetitions increase.

We see a similar story when comparing memory usage:

These findings are even more pronounced when studying a single field using dataloader promises:

Testing our experimental engine in production, we fetched various sized payloads of products and their child variants and found that these breadth-based scale advantages clearly translated into end-to-end response time improvements: we saved over 4s of time at P50 for our largest test queries.

Inspecting profiles of these tests confirmed the linear scaling fields theory: Cardinal requests were spending equal time on I/O and data staging, but were able to improve on GraphQL field execution and its neighboring garbage collection by huge margins to deliver these end-to-end time improvements.

How Cardinal breadth execution works

Now let’s dig into the internals of how Cardinal performs breadth-first GraphQL execution. We’ll execute through the following query that uses a simplified version of the Shopify Admin API:

Tree building

The first step for any request is to construct an execution tree. This tree has two main primitives: scopes and fields. A scope defines a typed closure with many fields, while a field has a return type and zero-to-many child scopes. Written as pseudocode, an execution tree looks like this:

Execution trees are built eagerly based on a request’s statically-resolvable AST. Abstract positions (not statically resolvable) are omitted from the tree and get built lazily once the parent field resolves its objects. The benefit of this pattern is that execution scopes are always concretely-typed and require no guesswork. An intentional constraint of this design is that an execution tree can only be navigated upward, never down.

Planning phase (lookbehind)

After tree building, Cardinal runs a bottom-up planning pass—heavily inspired by Grafast. During this pass, each field may consider its ancestors and register preloads and/or planning notes that may influence parent execution strategies. We offer this lookbehind pass as an alternative to lookahead, because lookahead cannot make informed choices about unresolved abstracts below it.

Execution

Now for the main event. We’ve built the execution tree from top-down; we’ve planned the tree from bottom-up; now it’s time to go top-down again running execution. Following the GraphQL spec, we start with a root object to resolve from, and an empty hash as its result data:

Note that each scope in the tree holds a set of objects and their mapped results that start empty. These sets will get filled in as we go.

Our first execution step runs field resolvers in the root scope. Resolvers are called only once per field with the scope’s complete set of objects, and they must return a mapped set of results:

In the above, a field resolver mapped the one shop object into one list of its products, which matches the schema. Next we key the resolved data structure into the scope’s results to establish list groupings and create new empty result hashes for each object:

Lastly, we flat-map out all resolved objects and their corresponding result hashes into the next scope as its objects and results. Algorithmically, this step can combine with building results so that we only traverse the resolved field data once:

One generation down! Now we repeat. Breadth really starts to shine when handling merged sets, as we see here:

Note that the generation ended with the next scope holding a flat mapping of all objects and results assembled before it. Flat sets are fast to process while amortizing setup work across subtrees. Finally, we’ll run this sequence one more time to finish off the leaf field selection:

That’s it! Or is it? You may be wondering when the response tree gets built. It’s easy to miss, but we already assembled it—looking at the root result object that we started execution with, it now looks like this:

Result hashes were keyed in-place and passed down by reference across scopes, to be shaped during the next generation. This pattern of passing flat sets is breadth’s superpower for sharing CPU-bound work cycles across list elements.

Errors

Unlike depth execution, breadth has no concept of subtrees by which to track error paths or bubble exceptions. As a result, breadth execution generally runs to completion (aside from failed mutation fields, which always terminate early). All rescued errors are inlined into the response tree, and then a depth traversal step is added at the end to locate and report on where errors occurred.

While this strategy is less surgical than depth-based execution, it may still net faster responses thanks to breadth's other performance advantages. Either way, it’s a reasonable tradeoff given that <1% of Shopify’s API traffic results in non-validation errors, so we chose to optimize for our majority success rate.

Engine

Another novel aspect of Cardinal’s breadth-first design is that the processing engine is driven by enqueuing rather than recursion. This avoids many of the deep stack traces that GraphQL is notorious for, and contributes to reducing Cardinal’s memory footprint. While Cardinal’s main execution loop has grown slightly over time, it started out as a single line of code:

Migrating to breadth execution

The prospect of actually adopting this breadth paradigm was far more challenging than our blue-sky prototyping work to develop it. Our entire core monolith was built around the traditional “receive and return one” field resolver interface, while breadth execution would require switching to “receive and return many.” We’d need an incremental strategy to bridge this gap.

A GraphQL Ruby interpreter

We started building an interpreter that would allow the Cardinal engine to puppet GraphQL Ruby’s runtime sequence for legacy fields. While this interpreter shouldn’t be any faster (it’d still need to run legacy field resolvers individually), it would allow us to run our existing stack while incrementally swapping out legacy resolvers for their faster breadth replacements.

We successfully tuned this interpreter to pass our entire core test suite. It was even slightly faster at list-heavy queries by cutting out some GraphQL Ruby redundancies, but it consumed more memory.

This was a moment where collaboration with Claude AI really shined: presented with our memory tradeoff, Claude was able to improve the interpreter’s memory efficiency by 40%. By the time we rolled out, the interpreter was slightly lighter and faster at list repetitions, and produced visible benefits for some list-heavy queries without changing any field resolvers.

Migrating tracers

Our field-level tracers that instrument field performance and schema metrics also scaled linearly in depth execution. An exciting outcome of migrating to breadth was that these tracers could now only run once per field selection, making them dramatically cheaper.

This change required some minor strategy adjustments—for example, field timings would need to capture a single duration for a breadth resolver, and then average that duration across the number of resolved objects. This was effectively how we were reporting the data anyway, so it was a relatively straightforward adaptation with much lower capture costs.

Migrating field resolvers

Now that the Cardinal engine runs our core stack, we’ve started a new leg of this journey focused on migrating legacy field resolvers to run breadth-first. This introduced a whole new set of challenges to safely manage the translation and rollout of tens of thousands of new field implementations. Our team has risen to this challenge with numerous innovations:

A library of Claude AI skills to accelerate breadth translations
A shadow verifier to confirm migrated breadth fields match their legacy counterparts
A benchmark suite for studying the performance of migrated queries
Numerous burndown and migration tracking metrics

This migration work is ongoing, and in many cases the translation of field resolvers is quite simple. The trickier cases are fields that share a query, or use nuanced early-return strategies that must be carefully matched. To date, all regressions can be attributed to mistakes in translation—we have yet to find a non-error scenario where breadth-based execution is fundamentally worse off.

Looking ahead

We see plenty more opportunities to continue building upon the Cardinal engine’s strengths. To date, everything we’ve achieved uses synchronous Ruby-native language features.

We see async patterns and lower-level C language bindings as major untapped opportunities for continuing development.

Try it

Shopify merchants require their business data to be readily accessible, and it’s our job to make technology scale to match that need. While our breadth-first approach was driven by our own requirements, our fundamental need to accelerate large lists is hardly unique.

We write this post as an open letter to the GraphQL community to present our findings and to start a conversation. The official spec says, "[GraphQL] conformance requirements expressed as algorithms can be fulfilled … in any way as long as the perceived result is equivalent." We think it’s time to shake up the status quo.

Rubyists can try out Cardinal’s breadth-first concepts in GraphQL Ruby’s new execution module that we’re collaborating on. As for the graphql-js community—which defines GraphQL’s defacto-standard implementation—we offer two final benchmarks that highlight the breadth-first potential relative to the language resources running it:

While it’s hard to make direct comparisons across languages and JIT strategies, the potential here looks worthy of investigation.

Greg MacWilliam is a Ruby engineer, GraphQL enthusiast, and the open-source author of graphql-stitching-ruby. Coder, dad, skier; likes dogs, juggles fire. Greg’s home office co-worker is an angry cardinal who attacks his own reflection in the window.