How Shopify Built a Model-Agnostic AI Stack

By Lazar Milicevic · Published June 25, 2026 · 9 min read

Developer working at laptop with terminal code visible, illustrating an LLM proxy and AI infrastructure setup

You're three months into building an AI feature. It works. Customers love it. Then your model provider emails: "We're deprecating this version in 60 days." Or worse — the API just starts returning 500s on a Tuesday morning and your support inbox explodes.

This is the reality of building on LLMs in 2026. Shopify hit this wall early, and instead of betting on one provider, they built an internal proxy that treats models as interchangeable parts. Here's how it works, why it matters, and how to build the same pattern into your stack — even if you're a team of one.

The problem: every LLM is a moving target

Models get deprecated. Providers change pricing without warning. Rate limits shift. A vendor you trusted last quarter quietly degrades quality on the model you fine-tuned your prompts around. If your production code has client = anthropic.Anthropic() hard-coded at the top of every file, you have a fragility problem disguised as a dependency.

Shopify's engineering team has talked openly about this. In public conference talks and engineering blog posts, their AI infrastructure team has described a pattern where every internal AI call routes through a single proxy layer. When Anthropic ships a new Claude version, when OpenAI deprecates an older endpoint, when a regional outage hits one provider, the proxy reroutes traffic. Application code never changes.

This is not exotic infrastructure. It's the same pattern that made cloud-native apps survive AWS region outages: abstract the dependency, route around failure. The novelty is applying it to LLMs, where the "service" you're calling has unstable APIs, unstable pricing, and a habit of disappearing.

What a model-agnostic proxy actually does

An LLM proxy is a thin service that sits between your application code and any model provider (Anthropic, OpenAI, Google, Mistral, a local Ollama instance, whatever). Your app calls one endpoint with a generic request format. The proxy decides:

Which provider gets the request right now
What to do if that provider returns an error or times out
Whether to retry on the same model or fall back to a different one
How to normalize the response so your app gets a consistent shape

The minimum viable version is maybe 200 lines of Python. The mature version — what Shopify and companies like Anthropic-customer Notion run — adds caching, observability, cost tracking per team, prompt versioning, and policy enforcement.

Here's the core abstraction:

# What your app code looks like with a proxy
response = llm_proxy.complete(
    task="customer_email_classification",
    prompt=user_input,
    max_tokens=500,
)

Notice what's missing: no model name, no provider SDK, no API key handling. The proxy maps task to a routing policy. If customer_email_classification is currently configured to use Claude Sonnet with GPT-4o as fallback, that's a config change, not a code change.

Building the routing layer: a minimal proxy in Python

Here's a stripped-down version of the pattern. This isn't production-ready — it's the skeleton to understand the shape.

import time
from typing import Callable
from dataclasses import dataclass

@dataclass
class ModelConfig:
    provider: str
    model_id: str
    timeout_s: int
    call: Callable

class LLMProxy:
    def __init__(self, routing_table: dict[str, list[ModelConfig]]):
        # routing_table maps task names to ordered fallback chains
        self.routes = routing_table

    def complete(self, task: str, prompt: str, max_tokens: int = 500):
        chain = self.routes.get(task)
        if not chain:
            raise ValueError(f"No route for task '{task}'")

        last_error = None
        for model in chain:
            try:
                start = time.time()
                result = model.call(prompt, max_tokens, model.timeout_s)
                self._log(task, model, time.time() - start, success=True)
                return result
            except Exception as e:
                last_error = e
                self._log(task, model, None, success=False, error=str(e))
                continue

        raise RuntimeError(f"All models failed for {task}: {last_error}")

    def _log(self, task, model, latency, success, error=None):
        # Ship to your logging/metrics pipeline
        print({
            "task": task,
            "provider": model.provider,
            "model": model.model_id,
            "latency_s": latency,
            "success": success,
            "error": error,
        })

And the routing config:

routing = {
    "customer_email_classification": [
        ModelConfig("anthropic", "claude-sonnet-4", 10, call_claude),
        ModelConfig("openai", "gpt-4o", 10, call_openai),
        ModelConfig("local", "llama-3.1-70b", 30, call_ollama),
    ],
    "long_document_summary": [
        ModelConfig("anthropic", "claude-opus-4", 60, call_claude),
        ModelConfig("openai", "gpt-4-turbo", 60, call_openai),
    ],
}

proxy = LLMProxy(routing)

When Claude Sonnet 4 gets deprecated, you change one line in the routing table. When a provider starts returning 529 overloaded errors at 3pm every day, the second model in the chain catches the requests automatically. Your application code doesn't know and doesn't care.

What "failover" really has to handle

Naive failover — "if request fails, try the next model" — is not enough. Real production failover handles at least these cases:

Failure mode	Naive retry	Proper handling
5xx from provider	Retry once on same model	Fast-fail after 1 retry, switch provider
Rate limit (429)	Retry with backoff	Switch provider immediately, don't wait
Timeout	Retry	Switch provider, log the slow one
Bad JSON / schema violation	Crash	Reroute to a model better at structured output
Quality degradation	Invisible	Periodic eval suite catches regressions
Model deprecated	Production breaks	Routing table flagged before sunset date

The quality dimension is the hard one. A provider can return a 200 OK with technically valid output that is subtly worse than what you got last week. Shopify and similar shops handle this with offline eval suites: a fixed set of representative prompts is rerun against every model in the chain on a schedule, and outputs are scored (either by another model or by humans on edge cases). When a model's score drops below threshold, it gets demoted in the routing table before users notice.

You don't need this on day one. You do need a way to swap models in 5 minutes when something breaks.

Normalizing requests and responses across providers

Anthropic's API and OpenAI's API are similar but not identical. System prompts go in different places. Tool-calling schemas differ. Streaming chunk formats differ. Token usage is reported differently. If your app code handles any of this directly, the proxy isn't actually decoupling anything.

The fix is a normalization layer inside the proxy. Define one internal request shape:

{
  "task": "extract_invoice_fields",
  "system": "You extract structured data from invoices.",
  "messages": [
    {"role": "user", "content": "..."}
  ],
  "tools": [
    {
      "name": "save_invoice",
      "schema": { "...": "..." }
    }
  ],
  "max_tokens": 1000,
  "temperature": 0
}

And one internal response shape:

{
  "provider": "anthropic",
  "model": "claude-sonnet-4",
  "text": "...",
  "tool_calls": [],
  "usage": {
    "input_tokens": 1240,
    "output_tokens": 312
  },
  "latency_ms": 1842
}

Inside the proxy, you have adapters per provider that translate to/from these shapes. This is unglamorous code, but it's the work that buys you the freedom to swap providers without touching application logic. The LiteLLM and OpenRouter open-source projects already do most of this if you don't want to write adapters yourself — both are worth evaluating before you build your own.

Cost and observability: the underrated wins

Once everything routes through one place, two things become trivial that were previously impossible:

Per-team, per-feature cost tracking. Every call carries a task label. Aggregate token usage by task and you know exactly which feature is costing what. Most teams discover that 80% of their LLM spend comes from 2-3 features they didn't expect. According to Anthropic's published pricing pages and OpenAI's pricing pages (always check the current ones — they change), input vs. output token costs can differ by 5x, and switching one high-volume feature from a premium model to a smaller one can cut a monthly bill meaningfully.

A single audit log. Every prompt, every response, every model used. When a customer reports a bad output from your AI feature three weeks ago, you can find the exact request and response. When legal asks "do we ever send customer PII to OpenAI," you can answer with data instead of a shrug.

This second one matters more for compliance-sensitive industries (healthcare, finance, anything touching personal data under frameworks like HIPAA or GDPR). A proxy gives you one chokepoint for redaction, logging, and policy enforcement instead of N integration points across your codebase.

Caching: the quietly massive cost lever

Identical prompts get identical-ish responses. A proxy is the natural place to cache. Two layers worth implementing:

Exact-match cache. Hash the full request (system + messages + temperature). If you've seen it in the last N hours, return the cached response. Useful for deterministic prompts at temperature 0.
Semantic cache. Embed the user prompt, look up similar prior prompts, and reuse the response if similarity is above a threshold. Useful for FAQ-style use cases where users ask the same thing five ways.

Anthropic also offers prompt caching at the API level — long system prompts and tool definitions can be cached server-side for a fraction of the input token cost. Your proxy should pass the right cache-control headers to take advantage of this. Check Anthropic's current prompt caching documentation for the exact discount and TTL.

Semantic caching is risky for anything where wrong answers matter (medical, legal, financial). For low-stakes use cases — customer support FAQ, internal docs Q&A — it can cut LLM spend significantly with minimal quality loss.

A practical adoption path for small teams

You are not Shopify. You don't need a dedicated AI platform team. Here's how to get most of the benefit in a week of work:

Pick a wrapper. LiteLLM gives you a drop-in OpenAI-compatible interface that proxies to ~100 providers with retry and fallback built in. OpenRouter does similar with a hosted model. For most solo operators and small teams, one of these gets you 80% of the value with zero infrastructure.
Centralize the calls. Wrap every LLM call in your app behind one internal function. No openai.chat.completions.create(...) scattered across 14 files.
Define your tasks. Name each LLM use case (classify_lead, summarize_ticket, draft_reply). Map each to a primary model and one fallback.
Log everything. Task name, provider, model, latency, token counts, success/failure. A simple Postgres table works. You'll thank yourself in three months.
Set up a deprecation alert. Subscribe to your providers' changelog feeds or RSS. When a model you use shows up on a deprecation list, you have weeks to swap it in the routing table — not hours after production breaks.
Eval suite, eventually. Pick 20-50 representative prompts per task. Run them against each model in your chain on a monthly cadence. Score outputs (model-judged is fine to start). When scores drift, investigate.

Steps 1-5 are achievable in a week for a solo developer. Step 6 is what separates teams that get bitten by silent quality regressions from teams that don't.

How BizFlowAI approaches this

This proxy-and-failover pattern is the default for every client agent we build. When we ship a customer-support automation or a lead-qualification agent, the LLM calls go through a routing layer with primary + fallback providers, structured logging, and a config-driven model map. When Anthropic or OpenAI deprecates a model — and they will — the swap is a config change reviewed in minutes, not an incident.

If you're running AI features in production and the thought of your primary provider going down for an afternoon makes you nervous, that's a fixable problem. Book a call and we'll map out where your stack is fragile and what the smallest version of this proxy pattern looks like for your specific workloads.

The takeaway

Treating LLMs as commodity inputs — not load-bearing dependencies — is the cheapest insurance you can buy against a market where models get deprecated, repriced, and rate-limited on someone else's schedule. The proxy pattern Shopify uses is not exotic. It's the same abstraction that made cloud apps survive their cloud providers' worst days, applied to a new dependency that happens to be more unstable than EC2 ever was.

You don't need a platform team to do this. You need one wrapper, one routing table, and a logging table. Build it once. Swap models forever.

Work with BizFlowAI

If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.

Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.

More guides like this on the BizFlowAI blog.

Frequently asked questions

What is an LLM proxy and why do I need one?

An LLM proxy is a thin service that sits between your application and model providers like Anthropic, OpenAI, or Google. Your app sends a generic request, and the proxy decides which model to call, handles retries, and normalizes responses. This lets you swap providers via config change instead of code change when models get deprecated or APIs fail. It's the same pattern cloud-native apps use to survive region outages, applied to LLMs.

How does Shopify handle LLM provider failures and deprecations?

Shopify routes every internal AI call through a single proxy layer with predefined fallback chains per task. When a provider returns errors, rate limits, or deprecates a model, the proxy automatically reroutes traffic to the next model in the chain. Application code never changes — only the routing table config. This pattern has been described in their public engineering talks and blog posts.

What's the difference between LiteLLM and OpenRouter?

Both are open-source projects that normalize requests and responses across LLM providers like Anthropic, OpenAI, and Google. LiteLLM is a Python library you run yourself, giving full control over routing and logging. OpenRouter is a hosted gateway service that handles billing and provider access for you. Either saves you from writing provider-specific adapter code from scratch.

How do I handle rate limit errors (429) from LLM providers?

Don't retry with backoff on the same provider — switch providers immediately. A 429 means that specific provider is throttling you, so waiting just delays the request. A proper proxy detects 429s and routes the next call to the fallback model in your chain. Log the throttled provider so you can spot patterns and adjust quotas.

How do I detect LLM quality degradation in production?

Run an offline eval suite: a fixed set of representative prompts rerun against every model in your routing chain on a schedule. Score outputs using another model as judge or human review for edge cases. When a model's score drops below threshold, demote it in the routing table before users notice. Provider APIs can return 200 OK with subtly worse output, so monitoring status codes alone isn't enough.