MCP Servers in Production: Auth, Timeouts, Retries

By Lazar Milicevic · Published June 14, 2026 · 11 min read

Data center server racks with network cables representing MCP server production infrastructure and authentication

Your MCP server works on your laptop. It lists deals from HubSpot, drafts a follow-up, and the demo lands. Then you point it at a real client's CRM with 40,000 contacts, a rotating OAuth token, and an SLA, and the wheels come off in week one.

This post is about what happens after the tutorial ends. Specifically: how to make an MCP server survive auth expiry, slow upstreams, partial failures, and the kind of traffic that comes from agents that don't sleep. If you're building toward something a business will actually depend on, this is the part nobody writes about.

What actually changes when MCP goes to production

A demo MCP server has one user (you), one tool call at a time, fresh tokens, and a forgiving network. Production has none of that.

The shift looks like this:

Concern	Demo	Production
Auth	Personal token pasted in `.env`	OAuth refresh, per-tenant credentials, rotation
Concurrency	One agent, one call	Multiple agents, parallel tool calls, rate limits
Failures	"Try again"	Idempotency keys, retries, dead-letter queue
Timeouts	Default (often unbounded)	Explicit budgets per call, per tool, per session
Observability	`print()`	Structured logs, trace IDs, per-tool metrics
Upstream	One CRM, your account	N tenants, each with quirks and quotas

The MCP spec itself doesn't solve any of this. It defines transport (stdio, HTTP/SSE, streamable HTTP) and a tool-calling contract. Everything above the wire is your problem. That's not a complaint — it's the right scope for a protocol. But it means the production work lives entirely in your server implementation.

Auth: stop pasting tokens, start refreshing them

The single biggest gap between demo and production MCP servers is auth. Most tutorials hardcode a personal access token. That works until the token expires, the user rotates it, or you onboard a second customer.

For real deployments you usually need three layers:

Caller → MCP server. The agent (Claude, your own runtime, an internal app) needs to authenticate to the MCP server. For HTTP transports, this is typically OAuth 2.1 with PKCE, or a signed bearer token issued by your control plane.
MCP server → upstream API. The CRM or ERP needs its own credentials, usually OAuth with refresh tokens, sometimes per-tenant API keys.
Tenant isolation. If you serve multiple customers, every tool call must resolve to the right tenant credentials and never leak across.

Here's the shape of a refresh-aware credential store, simplified:

import time
from dataclasses import dataclass
from typing import Optional

@dataclass
class TokenBundle:
    access_token: str
    refresh_token: str
    expires_at: float  # unix seconds

class CredentialStore:
    def __init__(self, kv, oauth_client, skew_seconds: int = 60):
        self.kv = kv                  # encrypted KV (e.g. KMS-backed)
        self.oauth = oauth_client
        self.skew = skew_seconds

    async def get_access_token(self, tenant_id: str) -> str:
        bundle: Optional[TokenBundle] = await self.kv.get(tenant_id)
        if not bundle:
            raise PermissionError(f"no credentials for tenant {tenant_id}")

        if bundle.expires_at - self.skew > time.time():
            return bundle.access_token

        # Refresh under a per-tenant lock so we don't burn refresh tokens
        async with self.kv.lock(f"refresh:{tenant_id}"):
            bundle = await self.kv.get(tenant_id)  # re-read inside lock
            if bundle.expires_at - self.skew > time.time():
                return bundle.access_token

            new_bundle = await self.oauth.refresh(bundle.refresh_token)
            await self.kv.put(tenant_id, new_bundle)
            return new_bundle.access_token

Three things to notice:

Skew. Refresh slightly before expiry. Clocks drift, and a token that's "valid for 60 more seconds" is a coin flip in practice.
Locking. Without a per-tenant lock, a burst of parallel tool calls will all see an expired token and all try to refresh. Some upstreams invalidate previous refresh tokens on use — you can lock yourself out.
Storage. Tokens go in encrypted storage with KMS, not in environment variables, and never in logs.

If you're serving more than one customer, also enforce tenant resolution at the transport layer, not inside individual tools. The tool function should receive a typed Context with the tenant already resolved. If a tool ever has to ask "which tenant am I?", you've already lost.

Timeout budgets, not timeout values

The mistake I see most often: developers set a single timeout — say, 30 seconds — on the upstream HTTP client and call it done. Then a tool call times out at the agent layer at 60 seconds, even though the upstream returned in 31. Or three sequential upstream calls each take 25 seconds, the tool returns in 75 seconds, and the agent has long since given up.

You need a budget, not a value. The budget is allocated at the top of a tool call and consumed by every downstream operation.

import asyncio
import time
from contextvars import ContextVar

_deadline: ContextVar[float] = ContextVar("_deadline")

class Deadline:
    def __init__(self, total_seconds: float):
        self.deadline = time.monotonic() + total_seconds

    def remaining(self) -> float:
        return max(0.0, self.deadline - time.monotonic())

    def __enter__(self):
        self._token = _deadline.set(self.deadline)
        return self

    def __exit__(self, *_):
        _deadline.reset(self._token)

def remaining_budget(reserve: float = 0.0) -> float:
    d = _deadline.get()
    return max(0.0, d - time.monotonic() - reserve)

async def http_get(client, url, **kwargs):
    # Always pass a timeout derived from the remaining budget,
    # reserving 250ms so we can return a clean error if we run out.
    timeout = remaining_budget(reserve=0.25)
    if timeout <= 0:
        raise TimeoutError("budget exhausted before request")
    return await client.get(url, timeout=timeout, **kwargs)

A reasonable default hierarchy for an MCP tool that hits one upstream:

Tool budget: the time the agent will wait. Pick something the model can actually use — usually 10–30 seconds for interactive work.
Per-upstream-call timeout: derived from remaining budget, capped at something sane (e.g. 8 seconds for a single REST call).
Reserve: 200–500 ms reserved at the end to return a structured error instead of dying mid-write.

If a tool needs more than ~30 seconds, it's no longer a synchronous tool. Make it an async job: return a job ID, expose a get_job_status tool, let the agent poll or subscribe. Long-running synchronous tools wreck agent loops and burn token budgets.

Retries: idempotency first, backoff second

Retries without idempotency are how duplicate invoices get created. Always design the contract before the retry policy.

Classify every tool by side effect:

Class	Examples	Retry safely?
Read-only	`search_contacts`, `get_deal`	Yes, freely
Idempotent write	`upsert_contact(external_id=...)`	Yes, with idempotency key
Non-idempotent write	`create_invoice`, `send_email`	Only with idempotency key the upstream honors
Side-effecty external	Stripe charge, webhook fire	Use upstream's idempotency mechanism

For non-idempotent operations, never retry blindly. Either:

Use the upstream's idempotency key header (Stripe, many modern APIs).
Do a read-before-write to check if a previous attempt succeeded.
Wrap the operation in a local outbox table and let a background worker handle retries with deduplication.

The retry policy itself is the easy part:

import asyncio, random

RETRYABLE_STATUSES = {408, 425, 429, 500, 502, 503, 504}

async def with_retry(fn, *, max_attempts=4, base=0.25, cap=4.0):
    last_exc = None
    for attempt in range(max_attempts):
        try:
            return await fn()
        except RetryableError as e:
            last_exc = e
            if attempt == max_attempts - 1:
                break
            # Honor server-supplied Retry-After if present
            if e.retry_after is not None:
                delay = min(e.retry_after, cap)
            else:
                # Exponential backoff with full jitter
                delay = random.uniform(0, min(cap, base * (2 ** attempt)))
            # Don't sleep past the deadline
            delay = min(delay, remaining_budget(reserve=0.25))
            if delay <= 0:
                break
            await asyncio.sleep(delay)
    raise last_exc

Notes from running this in anger:

Full jitter beats fixed backoff. It de-correlates clients during incidents.
Honor Retry-After. Salesforce, HubSpot, Stripe, GitHub all send it. Ignore it and you'll get throttled harder.
Cap attempts low. 3–4 attempts is usually right. Retrying 10 times on a struggling upstream just extends the outage.
Don't retry across the budget. If you have 800 ms left, don't sleep 2 seconds.
Circuit-break per upstream per tenant. One tenant's broken integration shouldn't poison everyone's tool calls.

Rate limits and per-tenant fairness

Once you have more than one tenant, a noisy one will eat your shared quota with the upstream API. CRMs and ERPs almost all have org-wide quotas that you can blow through faster than the customer expects.

Two patterns that work:

1. Token bucket per tenant per upstream. Sized to the upstream's documented limit, with a small shared overflow. This stops one tenant from starving the rest.

2. Concurrency caps before request caps. Most upstream incidents I've seen weren't "too many requests per second" — they were "too many concurrent long-running requests." A simple semaphore per tenant (e.g. 8 in-flight) prevents the kind of pileup that turns a slow upstream into a dead one.

For 429 responses specifically: back off, respect Retry-After, and surface the rate-limit state to the agent. A tool result that says "rate_limited": true, "retry_after_seconds": 12 is far more useful than a generic error — a well-built agent can decide to do other work in the meantime.

Observability you can debug at 2am

The agent calls a tool. The tool fails. The user sees "I couldn't update that deal." Without the right logs, you have nothing.

Minimum you should be emitting on every tool call:

{
  "ts": "2024-...",
  "trace_id": "01HV...",
  "session_id": "mcp-sess-...",
  "tenant_id": "acme",
  "tool": "crm.update_deal",
  "duration_ms": 842,
  "outcome": "ok",
  "upstream_calls": 2,
  "upstream_retries": 1,
  "budget_ms_total": 15000,
  "budget_ms_used": 842,
  "auth_refresh": false
}

Add structured error fields when things fail: error class, upstream status code, whether it was retryable, whether the deadline was the cause. Trace IDs should propagate through to upstream requests as headers when possible, so you can join your logs against the CRM's own audit trail.

Metrics worth alerting on:

p95 / p99 latency per tool (not just overall — averages lie).
Error rate per tool per tenant.
Auth refresh failures (this is almost always a leading indicator of a broken integration).
Deadline-exhausted rate. If this climbs, your budgets are wrong or an upstream is degrading.

Skip the dashboards-for-show. The two questions you actually need to answer at 2am are: "is this one tenant or everyone?" and "is this one tool or the whole server?" Build the views that answer those.

A minimal production-ready tool, end to end

Pulling the pieces together. This is roughly what a real CRM tool looks like once you stop demoing:

@server.tool("crm.update_deal")
async def update_deal(ctx: Context, deal_id: str, fields: dict) -> dict:
    tenant = ctx.tenant_id
    trace = ctx.trace_id

    with Deadline(total_seconds=20):
        token = await creds.get_access_token(tenant)
        client = ctx.http  # shared connection-pooled client

        async def do_update():
            timeout = remaining_budget(reserve=0.3)
            if timeout <= 0:
                raise TimeoutError("budget exhausted")

            resp = await client.patch(
                f"{ctx.crm_base_url}/deals/{deal_id}",
                json=fields,
                headers={
                    "Authorization": f"Bearer {token}",
                    "Idempotency-Key": f"{tenant}:{trace}:update:{deal_id}",
                    "X-Request-Id": trace,
                },
                timeout=timeout,
            )
            if resp.status_code in RETRYABLE_STATUSES:
                raise RetryableError.from_response(resp)
            if resp.status_code == 401:
                # Token may have been revoked mid-flight
                await creds.invalidate(tenant)
                raise RetryableError("auth invalidated", retry_after=0)
            resp.raise_for_status()
            return resp.json()

        async with tenant_semaphore(tenant, limit=8):
            result = await with_retry(do_update, max_attempts=3)

        log_tool_call(
            tool="crm.update_deal", tenant=tenant, trace=trace,
            outcome="ok", upstream_calls=1,
        )
        return {"ok": True, "deal": result}

There's nothing clever here. It's just every concern from the previous sections, in one place: tenant-scoped auth, budgeted timeout, idempotency key, concurrency cap, retry with backoff, structured log line. That's the bar for "production."

How BizFlowAI approaches this

We run MCP servers against client CRMs and ERPs — HubSpot, Salesforce, Pipedrive, NetSuite, custom ERPs — in production. The plumbing in this post (per-tenant credential stores, deadline-propagating clients, idempotency-keyed writes, per-tool dashboards) is roughly the shape of our shared runtime. We don't rebuild it per customer; we configure it per integration.

What we do per customer is the boring, high-leverage part: which of your systems actually deserves an MCP server first. A discovery call maps your stack, your auth posture, and the two or three workflows where an agent-callable surface earns its keep. Most teams don't need ten MCP tools — they need three that are correct, observable, and safe to retry.

Frequently asked questions

How do you handle OAuth token refresh in a production MCP server?

Store tokens in encrypted KMS-backed storage rather than environment variables, and refresh slightly before expiry using a clock skew of around 60 seconds. Use a per-tenant lock during refresh so parallel tool calls don't all try to refresh simultaneously, which can invalidate refresh tokens on upstreams that rotate them. Resolve tenant credentials at the transport layer and pass them in a typed Context, never asking individual tools to figure out which tenant they belong to.

What is a timeout budget in an MCP server and why use it instead of fixed timeouts?

A timeout budget is a single deadline allocated at the start of a tool call and shared across every downstream operation, rather than a fixed value per HTTP client. Each upstream call derives its timeout from the remaining budget, reserving 200-500ms at the end to return a structured error instead of dying mid-write. This prevents the common failure where sequential upstream calls each succeed under their own timeout but the total exceeds what the agent will wait for.

When is it safe to retry MCP tool calls?

Read-only operations like search or get can be retried freely. Idempotent writes with an external_id, and operations using upstream idempotency keys (Stripe, HubSpot, Salesforce), are also safe to retry. Non-idempotent writes like create_invoice or send_email should never be retried blindly — use the upstream's idempotency-key header, do a read-before-write check, or queue the operation in a local outbox table with deduplication.

What retry backoff strategy works best for MCP servers calling external APIs?

Use exponential backoff with full jitter (random delay between 0 and the capped exponential value) to de-correlate clients during incidents. Always honor the server's Retry-After header — Salesforce, HubSpot, Stripe, and GitHub all send it, and ignoring it leads to harder throttling. Cap attempts low (3-4) and never sleep past the remaining timeout budget.

Should long-running operations be implemented as synchronous MCP tools?

No. If a tool needs more than around 30 seconds, convert it to an async job pattern: return a job ID immediately and expose a separate get_job_status tool for the agent to poll. Long-running synchronous tools wreck agent loops, exhaust token budgets, and frequently exceed the model's patience before completing.

Work with BizFlowAI

If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.

Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.

More guides like this on the BizFlowAI blog.