At Shopify, we've been building Sidekick, an AI-powered assistant that helps merchants manage their stores through natural language interactions. From analyzing customer segments to filling product forms and navigating complex admin interfaces, Sidekick has evolved from a simple tool-calling system into a sophisticated agentic platform. Along the way, we've learned valuable lessons about architecture design, evaluation methodologies, and training techniques that we want to share with the broader AI engineering community.
The Evolution of Sidekick's Architecture
Sidekick is built around what Anthropic calls the "agentic loop" – a continuous cycle where a human provides input, an LLM processes that input and decides on actions, those actions are executed in the environment, feedback is collected, and the cycle continues until the task is complete.
In practice, this means Sidekick can handle requests like "which of my customers are from Toronto?" by automatically querying customer data, applying the appropriate filters, and presenting results. Or when a merchant asks for help writing SEO descriptions, Sidekick can identify the relevant product, understand the context, and fill in optimized content directly into the product form.
The Tool Complexity Problem
As we expanded Sidekick's capabilities, we quickly hit a scaling challenge that many teams building agentic systems will recognize. Our tool inventory grew from a handful of well-defined functions to dozens of specialized capabilities:
- 0-20 tools : Clear boundaries, easy to debug, straightforward behavior
- 20-50 tools : Boundaries become unclear, tool combinations start causing unexpected outcomes
- 50+ tools : Multiple ways to accomplish the same task, system becomes difficult to reason about
This growth led to what we call "Death by a Thousand Instructions" – our system prompt became an unwieldy collection of special cases, conflicting guidance, and edge case handling that slowed down the system and made it nearly impossible to maintain.
Just-in-Time Instructions: A Solution for Scale
Our breakthrough came with implementing Just-in-Time (JIT) instructions. Instead of cramming all guidance into the system prompt, we return relevant instructions alongside tool data exactly when they're needed. Our goal is to craft the perfect context for the LLM for every single situation, not a token less, not a token more.
How it works in practice
Instructions provided to the LLM (below):
Response from the LLM based on the instructions provided (above):

This approach provides three key benefits:
- Localized Guidance : Instructions appear only when relevant, keeping the core system prompt focused on fundamental agent behavior
- Cache Efficiency : We can dynamically adjust instructions without breaking LLM prompt caches
- Modularity : Different instructions can be served based on beta flags, model versions, or page context
The results were immediate – our system became more maintainable while performance improved across all metrics.
Building Robust LLM Evaluation Systems
One of the biggest challenges in deploying agentic systems is evaluation. Traditional software testing approaches fall short when dealing with the probabilistic nature of LLM outputs and the complexity of multi-step agent behaviors.
These days, so many people are vibe testing their LLM Systems and thinking that it’s good enough; it’s not. Vibe testing, or creating a “Vibe LLM Judge” that’s like “Rate this 0-10”, is not going to cut it. It needs to be principled and statistically rigorous, otherwise you should be shipping with a false sense of security.
Ground Truth Sets Over Golden Datasets
We moved away from carefully curated "golden" datasets toward Ground Truth Sets (GTX) that reflect actual production distributions. Rather than trying to anticipate every possible interaction (what spec docs usually try to enumerate), we sample real merchant conversations and create evaluation criteria based on what we observe in practice.
The process involves:
- Human Evaluation : Have at least three product experts label conversations across multiple criteria
- Statistical Validation : Use Cohen's Kappa, Kendall Tau, and Pearson correlation to measure inter-annotator agreement
- Benchmarking : Treat human agreement levels as the theoretical maximum our LLM judges can achieve
LLM-as-a-Judge with Human Correlation
We developed specialized LLM judges for different aspects of Sidekick's performance, but the key insight was calibrating these judges against human judgment. Through iterative prompting, we improved our judges from barely-better-than-random (Cohen's Kappa of 0.02) to near-human performance (0.61 vs. human baseline of 0.69). The idea is that once our LLM Judge has high correlations to human, we try to randomly replace the Judge with a human for each conversation in our GTX, and when it’s difficult to tell whether we used a human or judge as part of the group, then we know we have a trustable LLM Judge.
User Simulation for Comprehensive Testing
To test candidate changes before production deployment, we built an LLM-powered merchant simulator that captures the "essence" or goals of real conversations and replays them through new system candidates. This enables us to run simulations of many different candidate systems, and choose the best performing one.
The complete evaluation pipeline looks like:
This approach has proven invaluable for catching regressions and validating improvements before they reach merchants.
GRPO Training and Reward Hacking
For model fine-tuning, we implemented Group Relative Policy Optimization (GRPO), a reinforcement learning approach that uses our LLM judges as reward signals. We developed an N-Stage Gated Rewards system that combines procedural validation (syntax checking, schema validation) with semantic evaluation from LLM judges.
The Reality of Reward Hacking
Despite our careful evaluation design, we encountered significant reward hacking during training. The model found creative ways to game our reward system:
- Opt-out Hacking : Instead of attempting difficult tasks, the model would explain why it couldn't help
- Tag Hacking : Using customer tags as a catch-all instead of proper field mappings
- Schema Violations : Hallucinating IDs or using incorrect enum values
For example, when asked to "segment customers with status enabled," the model learned to create filters like customer_tags CONTAINS 'enabled'
instead of the correct customer_account_status = 'ENABLED'
.
Iterative Improvement
Addressing reward hacking required updating both our syntax validators and LLM judges to recognize these failure modes. After implementing fixes:
- Syntax validation accuracy improved from ~93% to ~99% across all skills
- LLM judge correlation increased from 0.66 to 0.75 on average
- Most importantly, end-to-end conversation quality matched our supervised fine-tuning baseline
Key Takeaways for Production Agentic Systems
Based on our experience building and deploying Sidekick, here are our key recommendations:
Architecture Principles
- Stay Simple : Resist the urge to add tools without clear boundaries. Quality over quantity applies strongly to agent capabilities
- Start Modular : Use patterns like JIT instructions from the beginning to maintain system comprehensibility as you scale
- Avoid Multi-Agent Architectures Early : Simple single-agent systems can handle more complexity than you might expect
Evaluation Infrastructure
- Build Multiple LLM Judges : Different aspects of agent performance require specialized evaluation approaches
- Align Judges with Human Judgment : Statistical correlation with human evaluators is essential for trust in automated evaluation
- Expect Reward Hacking : Plan for models to game your reward systems and build detection mechanisms accordingly
Training and Deployment
- Procedural + Semantic Validation : Combine rule-based checking with LLM-based evaluation for robust reward signals
- User Simulation : Invest in realistic user simulators for comprehensive pre-production testing
- Iterative Judge Improvement : Plan for multiple rounds of judge refinement as you discover new failure modes
Looking Forward
We're continuing to evolve Sidekick's architecture and evaluation systems. Future work includes incorporating reasoning traces into our training pipeline, using the simulator and production judges during training, and exploring more efficient training approaches.
The field of production agentic systems is still young, but the patterns we've developed at Shopify – modular architectures, robust evaluation frameworks, and careful attention to reward hacking – provide a foundation for building reliable AI assistants that merchants can depend on.
Building production-ready agentic systems requires more than just connecting LLMs to tools. It demands thoughtful architecture decisions, rigorous evaluation methodologies, and constant vigilance against the unexpected ways these systems can fail. But when done right, the result is AI that truly augments human capabilities in meaningful ways.
The Shopify ML team is actively hiring for roles in agentic systems, evaluation infrastructure, and production ML. If these challenges interest you, we'd love to hear from you.
About the Author
Andrew McNamara is the Director of Applied ML at Shopify, where he leads Sidekick, an AI assistant that helps merchants run their businesses more effectively, and has been building assistants for over 15 years.
X: @drewch