AI Agents in Production: What Breaks After the Demo

By Lazar Milicevic · Published June 15, 2026 · 12 min read

Developer debugging production AI agent code on laptop with terminal logs and monitoring dashboards visible

Your agent demo worked. The investor clapped, the client signed, the Loom got 200 views. Then Tuesday hit: the OpenAI API returned a 529, the agent retried the same tool call seventeen times, and somewhere a customer got four duplicate refund emails. Welcome to week two.

This post is about the gap between "it works on my laptop with a happy-path prompt" and "it runs unattended for thirty days without paging me at 3am." The failure modes are predictable. The fixes are mostly boring. Both are worth knowing before you ship.

The demo lies, and you let it

A demo is a controlled environment. One input, one happy path, one model call, one tool, one developer watching the terminal. Production is none of those things.

What the demo hides:

Concurrency. Two users hitting the same agent at the same time, sharing or stomping on state.
Partial failure. The agent calls three tools; the second one half-succeeds (row inserted, webhook never fired).
Long tails. P50 latency is 4s, P99 is 90s, and your frontend times out at 30s.
Drift. The model that picked the right tool 95% of the time last month picks it 88% of the time after a silent provider update.
Inputs you didn't imagine. A user pastes a 40,000-token email thread. Another sends an emoji-only message.

The first job of taking an agent to production is to stop trusting the demo. Write down every assumption the demo made and assume the real world will violate each one. Then design for the violation.

State: the silent killer

Most demo agents are stateless from the developer's point of view — they hold context in a Python variable, a list, or the LLM's conversation history. That works until the process restarts, scales horizontally, or two requests interleave.

Three rules for production agent state:

Externalize it. Conversation history, tool-call logs, and intermediate results go in a database (Postgres is fine), not in memory.
Make it idempotent. Every step should have a stable key so that a retry doesn't re-do work.
Make it inspectable. You need to read the state of any in-flight agent run without attaching a debugger.

A minimal schema:

CREATE TABLE agent_runs (
  run_id          UUID PRIMARY KEY,
  user_id         TEXT NOT NULL,
  status          TEXT NOT NULL,  -- pending, running, waiting_tool, done, failed
  current_step    INT NOT NULL DEFAULT 0,
  input_payload   JSONB NOT NULL,
  scratchpad      JSONB NOT NULL DEFAULT '{}',
  created_at      TIMESTAMPTZ NOT NULL DEFAULT now(),
  updated_at      TIMESTAMPTZ NOT NULL DEFAULT now()
);

CREATE TABLE agent_steps (
  run_id        UUID NOT NULL REFERENCES agent_runs(run_id),
  step_index    INT NOT NULL,
  tool_name     TEXT NOT NULL,
  tool_input    JSONB NOT NULL,
  tool_output   JSONB,
  status        TEXT NOT NULL,  -- pending, running, ok, error
  attempt       INT NOT NULL DEFAULT 0,
  idempotency_key TEXT NOT NULL,
  PRIMARY KEY (run_id, step_index)
);

With this, your worker loop becomes: load the run, find the next pending step, execute it, persist the result, repeat. Crash anywhere — the next worker picks up at exactly the right place.

Retries that don't make things worse

Retries are the most common source of week-two pain because the obvious implementation is wrong in subtle ways.

The naive version:

for attempt in range(3):
    try:
        return call_tool(args)
    except Exception:
        time.sleep(1)

Problems with this code, in order of how badly they'll burn you:

No distinction between retryable and non-retryable errors. A 400 (your fault) gets retried three times. A 401 (auth) gets retried three times. A 500 (their fault) gets retried, which is fine.
No backoff. Three calls in three seconds during an outage is how you get rate-limited or banned.
No idempotency. If call_tool is "send email" and the API timed out after sending, you just sent three emails.
No budget. The agent has six tools, each retrying three times. One bad call can cascade into a 90-second user-facing freeze.

A better shape:

RETRYABLE_STATUSES = {429, 500, 502, 503, 504, 529}

def call_with_retries(tool, args, idempotency_key, max_attempts=4):
    for attempt in range(max_attempts):
        try:
            return tool.invoke(args, idempotency_key=idempotency_key)
        except ToolError as e:
            if e.status not in RETRYABLE_STATUSES:
                raise  # don't retry client errors
            if attempt == max_attempts - 1:
                raise
            sleep = min(2 ** attempt, 30) + random.uniform(0, 1)
            time.sleep(sleep)

Two things matter more than the code:

Idempotency keys are the contract. Generate one per logical step (not per attempt) and pass it to every external API that supports it (Stripe, most modern SaaS). For APIs that don't, your wrapper has to deduplicate based on a "I already did this" record before calling.
Set a global budget per run. No single agent run should be allowed to make 200 tool calls because of a retry storm. Track attempts in the agent_steps table and fail loudly when the budget blows.

Partial failures are the actual failure mode

In a happy-path demo, tools either work or throw. In production, the most painful state is the one in between: the API returned 200 but the database write rolled back, or the row got created but the confirmation webhook never fired, or the email was queued but not sent.

A few patterns that help:

Read-after-write checks. When a tool's success matters (creating an invoice, sending a payout), the next step should be "verify it." Don't trust the 200.

Outbox pattern for side effects. Instead of the agent directly emailing a customer, write the email to an outbox table in the same transaction as the state update. A separate worker drains the outbox. This way the agent's "I sent it" and the world's "it was sent" stay in sync, and crashes mid-flight don't lose or duplicate the action.

def send_customer_email(run_id, step_index, to, subject, body):
    with db.transaction():
        db.execute("""
            INSERT INTO outbox (run_id, step_index, kind, payload, status)
            VALUES (%s, %s, 'email', %s, 'pending')
            ON CONFLICT (run_id, step_index) DO NOTHING
        """, (run_id, step_index, json({"to": to, "subject": subject, "body": body})))
        db.execute("""
            UPDATE agent_steps SET status='ok'
            WHERE run_id=%s AND step_index=%s
        """, (run_id, step_index))

The agent thinks the work is done as soon as the row exists. The actual SMTP call happens in a separate, retry-safe worker that won't double-send because of the unique key.

Compensating actions. Some failures can't be retried — they need to be undone. If your agent creates a Shopify order and then fails to charge the card, you need a "cancel order" step, not another retry. Map out which steps need compensation before you ship, not after.

The API that times out on Tuesdays

Every external dependency will fail. The question is what your agent does when it does.

A rough taxonomy of dependency failures and reasonable responses:

Failure type	Example	Reasonable response
Transient (5xx, network)	Provider has a bad hour	Retry with backoff, then queue for later
Rate limit (429)	You hit a per-minute cap	Backoff respecting `Retry-After`, lower concurrency
Auth (401/403)	Token expired or revoked	Refresh once; if still failing, alert a human
Bad input (400)	Malformed request	Do not retry; log payload; fail the step
Timeout	No response in N seconds	Cancel, check idempotency, decide retry vs. verify
Quota / billing	Out of credits	Stop the run, page someone

A few defaults that prevent most of the pain:

Tight per-call timeouts. If a tool normally takes 2s, don't let it hang for 60. Pick a number, enforce it, fail fast.
Circuit breakers. If a downstream API has failed 20 times in the last minute, stop trying for 30 seconds. You're not helping by hammering it.
Bulkheads. Don't share connection pools, retry budgets, or worker pools across unrelated tools. A misbehaving CRM shouldn't take down your email tool.
Asynchronous when possible. If the user doesn't need an immediate answer, run the agent in a background job and notify on completion. This single decision eliminates most timeout-related complaints.

Observability: if you can't see it, you can't fix it

The single biggest difference between teams that successfully run agents in production and teams that don't is logging discipline. Not "I have some logs." Structured, queryable, run-scoped logs that let you replay any conversation in under a minute.

Minimum bar:

Every LLM call logged with: run_id, step_index, model, prompt tokens, completion tokens, latency, full prompt, full response, tool calls extracted.
Every tool call logged with: run_id, step_index, tool, input, output, status, attempt, latency, idempotency key.
Every state transition logged with: run_id, old status, new status, reason.

log.info("llm_call", extra={
    "run_id": run_id,
    "step": step_index,
    "model": "gpt-4o-mini",
    "prompt_tokens": usage.prompt_tokens,
    "completion_tokens": usage.completion_tokens,
    "latency_ms": elapsed_ms,
    "tool_calls": [tc.name for tc in response.tool_calls],
})

Send this to anywhere you can query it — Datadog, Honeycomb, a Postgres table, doesn't matter. What matters is that when a customer says "the agent did something weird yesterday at 4pm," you can find the exact run and read it.

On top of structured logs, add three categories of metrics:

Run-level. Success rate, P50/P95/P99 latency, average tool calls per run, average cost per run.
Tool-level. Per-tool success rate, latency, retry count.
Model-level. Token usage by model, cost by model, refusal/error rate.

A 10% drop in your "agent successfully completed task" rate is a leading indicator of every kind of breakage — model drift, API changes, new edge cases in input. You can't see that drop without the metric.

Evals: the only thing that stops silent regression

You changed the prompt to fix one bug. Did you break three others? Without evals, you have no idea, and neither does the customer until they hit one of the new bugs.

A pragmatic eval setup for a small team:

A folder of 30-100 real example inputs, drawn from logs, with expected behavior described.
A script that runs the current agent against all of them and produces a pass/fail (or graded) result.
Run it on every prompt change, model change, and tool change. Run it nightly against production traffic samples to catch drift.

# evals/cases/refund_eligible.yaml
input:
  customer_email: "I want a refund for order 4421"
  customer_id: "cus_8821"
expectations:
  - tool_called: lookup_order
    with: {order_id: "4421"}
  - tool_called: check_refund_policy
  - final_response_contains: "refund has been initiated"
  - no_tool_called: charge_card

You don't need a framework. A YAML file, a Python runner, and a CI job get you 80% of the value. The hard part is the discipline to add a new case every time something breaks in production. Do that and your agent gets monotonically more reliable; skip it and you'll be debugging the same regressions forever.

Cost and latency budgets you actually enforce

A demo doesn't care if a single run costs $0.40 and takes 45 seconds. Production does, especially when traffic grows.

Set budgets explicitly:

Per-run cost ceiling. If a single run is projected to exceed N cents, stop and fall back (human handoff, simpler model, error).
Per-run step ceiling. Most agents that loop forever do so because of a missing exit condition. Cap the number of tool calls hard.
Per-run latency ceiling. Set a wall-clock deadline. When it's exceeded, return a partial result or a graceful failure, not a hang.

Cheap defenses that pay off:

Cache deterministic tool outputs (a product lookup by ID) inside a run.
Use a smaller model for routing and a larger model only for the steps that need it.
Truncate or summarize long context aggressively before sending to the LLM; most agents don't need the full thread, they need the last decision plus a summary.

How BizFlowAI approaches this

The patterns above — externalized state, idempotency keys, outbox tables, structured run logs, eval files — aren't theoretical. They're what we ended up building after debugging enough week-two agent failures for clients who had shipped something a vendor or a freelancer left as a happy-path script. Most of those projects didn't need to be rewritten; they needed the retry logic, the observability, and the eval harness that should have been there from day one.

When we build an automation for a solopreneur or a small ops team, we design for the second week first. That means assuming the API will time out, the model will drift, the user will paste something weird, and the worker will get killed mid-step — and making sure none of those break the system. That's mostly what a discovery call with us is: walking through what your agent actually has to do and pointing out where it will quietly fall over before it ships.

The short version

If you only remember five things from this post:

Externalize state so any run is recoverable and inspectable.
Use idempotency keys on every step with side effects, not just retries.
Distinguish error classes before deciding to retry; give every run a hard budget.
Log every LLM and tool call with enough detail to replay it.
Write evals and run them on every change.

The demo is the easy part. The reason agents fail in production is rarely the model — it's the plumbing around the model. Get the plumbing right and you have a system you can leave running. Skip it and you have a pager.

Frequently asked questions

Why do AI agents work in demos but fail in production?

Demos run a single happy-path input with one developer watching, while production introduces concurrency, partial failures, long-tail latency, model drift, and unexpected inputs. The agent's state lives in memory and breaks on restart or scale-out. External APIs return 429s, 5xx errors, and timeouts that the demo never hit. The fix is to assume every demo assumption will be violated and design state, retries, and failure handling accordingly.

How should I store state for an AI agent in production?

Externalize agent state to a database like Postgres rather than holding it in Python variables or LLM conversation history. Use two tables: one for agent runs (status, scratchpad, input) and one for individual steps (tool name, input, output, attempt count, idempotency key). This makes runs resumable after crashes, inspectable without a debugger, and safe for horizontal scaling. The worker loop just loads the run, executes the next pending step, and persists the result.

What is the right way to implement retries for LLM tool calls?

Only retry on transient errors (429, 500, 502, 503, 504, 529) and never on 400 or 401. Use exponential backoff with jitter, cap the maximum delay, and enforce a global retry budget per agent run so one bad tool can't cascade into 200 calls. Always pass an idempotency key generated per logical step (not per attempt) so retries don't send duplicate emails or charges. For APIs without native idempotency, deduplicate in your wrapper using a record of completed work.

What is the outbox pattern and why use it for AI agents?

The outbox pattern means writing side effects (emails, webhooks, payments) as rows in an outbox table inside the same transaction that updates agent state, instead of calling the external API directly. A separate worker drains the outbox and performs the real action, retrying safely because of a unique key. This prevents the common production bug where an agent crashes after sending an email but before recording it, or records success when the email never went out. It keeps the agent's view of reality consistent with what actually happened.

How should an AI agent handle external API failures?

Match the response to the failure type: retry with backoff on 5xx and network errors, respect Retry-After on 429, refresh tokens once on 401, never retry 400s, and stop the run entirely on quota or billing errors. Set tight per-call timeouts (a 2-second tool shouldn't hang 60 seconds), add circuit breakers that pause calls after repeated failures, and use bulkheads so a broken CRM doesn't take down email. When users don't need an immediate response, run the agent as a background job to eliminate timeout pressure.

Work with BizFlowAI

If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.

Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.