Multi-Agent Systems Are Overrated: When One Agent Wins

You read a paper about agent swarms, watched a demo where five agents debate the answer, and now your roadmap has a "Researcher → Planner → Coder → Critic → Reviewer" diagram on it. Meanwhile the actual task — summarize a sales call, draft a follow-up, log it to HubSpot — fits in one prompt and three tool calls. Before you wire up an org chart of LLMs, it's worth asking whether the problem you're solving actually needs one.
The pattern most teams are copying without thinking
The popular "multi-agent" architecture looks like this: a planner agent decomposes a task, specialist agents execute sub-tasks in parallel or sequence, a critic agent reviews output, and an orchestrator stitches it all together. It looks impressive in a diagram. It usually behaves worse than a single model with the right tools.
What you actually get when you wire this up for a normal business task:
- Latency stacks. Five sequential LLM calls instead of one. Even with fast models, you've turned a 4-second task into 20+ seconds.
- Token bills compound. Each agent re-reads context, re-explains itself, and produces intermediate artifacts that nobody outside the system will ever see.
- Error compounds too. If each agent is 95% reliable, a five-step pipeline is roughly 77% reliable end-to-end. Now you need retry logic, fallback paths, and a way to debug failures across hops.
- Debugging is miserable. When the output is wrong, which agent broke? You're now reading five traces instead of one.
None of this means multi-agent is useless. It means the default should be one agent, and you should have to earn the second one with a concrete reason.
What a "single well-tooled agent" actually looks like
Most of the work people throw at multi-agent systems is really one prompt and a handful of tools. The model decides what to call, the tools do the deterministic work, and the model writes the final answer.
A worked example. Say you want to triage an incoming support email: classify it, look up the customer, draft a reply, and create a ticket if needed.
from anthropic import Anthropic
client = Anthropic()
tools = [
{
"name": "lookup_customer",
"description": "Look up a customer by email. Returns plan, MRR, and open tickets.",
"input_schema": {
"type": "object",
"properties": {"email": {"type": "string"}},
"required": ["email"],
},
},
{
"name": "create_ticket",
"description": "Create a support ticket. Use only for issues that need human follow-up.",
"input_schema": {
"type": "object",
"properties": {
"email": {"type": "string"},
"subject": {"type": "string"},
"priority": {"type": "string", "enum": ["low", "normal", "high"]},
"summary": {"type": "string"},
},
"required": ["email", "subject", "priority", "summary"],
},
},
{
"name": "draft_reply",
"description": "Save a draft reply to the customer (does not send).",
"input_schema": {
"type": "object",
"properties": {"email": {"type": "string"}, "body": {"type": "string"}},
"required": ["email", "body"],
},
},
]
SYSTEM = """You are a support triage assistant.
For each incoming email:
1. Look up the customer.
2. If their issue needs human follow-up, create a ticket with appropriate priority.
3. Always draft a reply.
Be concise. Don't invent facts about the customer."""
def handle_email(raw_email: str):
messages = [{"role": "user", "content": raw_email}]
while True:
resp = client.messages.create(
model="claude-sonnet-4-5",
system=SYSTEM,
tools=tools,
messages=messages,
max_tokens=1024,
)
if resp.stop_reason == "end_turn":
return resp
messages.append({"role": "assistant", "content": resp.content})
tool_results = [run_tool(b) for b in resp.content if b.type == "tool_use"]
messages.append({"role": "user", "content": tool_results})
That's the whole thing. One model, three tools, a loop. No planner. No critic. No orchestrator. If you wrote this as a multi-agent system, you'd have a "Classifier Agent," a "Lookup Agent," a "Ticket Agent," and a "Drafting Agent" — and they'd all be doing what one model can do in a single conversation.
Where multi-agent actually earns its complexity
There are real cases where splitting work across agents is the right call. The honest list is shorter than the hype suggests.
1. Genuinely parallel work with isolated context. If you need to research ten companies in parallel and synthesize the results, ten sub-agents each with their own context window is faster and cleaner than one agent that has to juggle all ten in working memory. The key word is isolated: the sub-tasks don't need to talk to each other.
2. Context window pressure. Some workflows — large codebase refactors, long-document analysis — genuinely overflow a single context window. Spawning sub-agents to handle scoped chunks, each returning a compact summary, is a real architectural answer, not a vibes-based one.
3. Different models for different jobs. A cheap fast model for classification, a frontier model for the hard reasoning step, a specialized model for code. This is technically "multi-agent" but it's really just routing.
4. Adversarial review for high-stakes output. Legal drafts, financial summaries, anything that goes to a regulator. A separate critic pass with a different prompt (or different model) catches things the writer missed. Note: this is usually one extra call, not a five-agent committee.
5. Long-running workflows with checkpoints. Multi-hour or multi-day processes where state has to persist, humans approve between stages, and steps run on different schedules. Here "multi-agent" really means "multi-step workflow with LLM steps in it" — which is a different beast from agent-to-agent chatter.
If your use case isn't on that list, you're almost certainly building an org chart for a task that wants a single prompt.
A decision table you can actually use
| Signal | Single agent | Multi-agent |
|---|---|---|
| Task fits in one context window | ✅ | — |
| Steps are inherently sequential | ✅ | — |
| Sub-tasks share most of the same context | ✅ | — |
| Sub-tasks are independent and parallelizable | — | ✅ |
| Need different models for cost or capability | Routing is fine | ✅ |
| Output is high-stakes (legal, financial, medical) | + critic pass | ✅ |
| Workflow spans hours/days with human approval | — | ✅ |
| You're doing it because the architecture diagram looks cool | ✅ (stop) | ❌ |
If you're undecided, default to single agent. You can always split later when you have a concrete failure mode to point at. You cannot easily un-split a tangled crew of agents that have grown around each other.
How to harden a single agent instead of multiplying it
When a single agent isn't reliable enough, the instinct is to add another agent to check its work. Usually the real fix is one of these:
Tighten the tool surface. Most agent failures are tool failures: ambiguous schemas, tools that overlap, tools that do too much. Each tool should do one thing, have an unambiguous description, and return structured data the model can actually reason about. If your update_customer tool takes a free-form changes string, you're going to have a bad time.
Constrain output with schemas. If you need structured output, use structured output. Don't ask a critic agent to verify that the writer agent produced valid JSON — make invalid JSON impossible.
{
"name": "log_call_outcome",
"input_schema": {
"type": "object",
"properties": {
"outcome": {"type": "string", "enum": ["qualified", "not_qualified", "follow_up"]},
"next_action_date": {"type": "string", "format": "date"},
"notes": {"type": "string", "maxLength": 500}
},
"required": ["outcome", "notes"]
}
}
Add deterministic guardrails around the agent, not inside it. Validate tool inputs before executing. Rate-limit destructive actions. Require human approval for anything irreversible. These are boring software engineering moves and they outperform "ask another LLM to double-check."
Use one critic pass when you genuinely need it. Not a critic agent that lives in a loop with the writer. Just: run the agent, take its output, run one verification prompt, decide whether to ship or retry. That's two LLM calls, not five.
Log everything and look at the logs. Most teams I talk to have never read the actual tool-use traces from their agent. They're tuning prompts based on vibes. Read fifty traces end-to-end and you'll find the real failure mode in an afternoon.
The honest cost comparison
Let's compare a realistic scenario: triage 1,000 inbound leads per week, with classification, enrichment, scoring, and a drafted outreach email.
Single agent approach:
- 1 LLM call per lead (with tool use loop, call it 2-3 internal model invocations)
- Roughly one round-trip of latency the user perceives
- One prompt to maintain
- One trace to debug
Five-agent crew approach (planner → classifier → enricher → scorer → writer → reviewer):
- 5-6 LLM calls per lead, each re-reading shared context
- ~5x the latency
- Six prompts to keep in sync as your business rules change
- Six places where a failure can happen, with handoffs to debug between them
For a high volume workflow, the cost delta is not marginal. And the multi-agent version is almost never more accurate on routine business tasks — it's frequently less accurate because errors cascade and each agent has only a slice of the picture.
The multi-agent version starts winning when individual sub-tasks need wildly different context, models, or run on different schedules. For most ops automation? It's a tax you pay for a diagram.
How BizFlowAI approaches this
We've inherited a fair number of multi-agent systems from previous vendors or in-house experiments. The most common rewrite we do is collapsing four-or-five-agent crews into one well-tooled agent with a tight schema and a clear loop. The wins are usually some combination of lower latency, lower token spend, and — the one people don't expect — higher reliability, because we've stopped accumulating error across handoffs.
That doesn't mean we never run multi-agent. We do, when the workload is genuinely parallel or when one step needs a different model. But we treat that as an architectural decision with a cost, not a default. If you're staring at an agent diagram that's grown organically and you're not sure whether it should be one agent or six, a discovery call is a cheap second opinion before you commit another quarter to it.
A short checklist before you build a multi-agent system
Before you draw the planner/worker/critic diagram, answer these out loud:
- Can I describe the task in one prompt? If yes, start there.
- Do the sub-tasks need genuinely different context, or am I splitting for neatness?
- What concrete failure mode am I solving by adding another agent? "Better quality" is not a failure mode.
- What's my latency budget? Does the multi-agent version still fit it?
- How will I debug this in production at 2am?
- If I delete the planner agent and just give the worker model the tools directly, does anything actually break?
If you can't answer (3) with a specific bug you've seen, you don't need another agent. You need to read your traces.
The single-agent-with-tools pattern is unfashionable right now because it doesn't make for a good conference talk. It also happens to be what ships, what stays up, and what your finance team won't have to argue with you about next quarter. Start there. Earn the second agent.
Frequently asked questions
When should I use a multi-agent system instead of a single agent?
Use multi-agent only when sub-tasks are genuinely independent and parallelizable, when the work overflows a single context window, when different models are needed for cost or capability, when high-stakes output requires an adversarial critic pass, or when the workflow spans hours or days with human approval steps. For most business tasks like email triage, summarization, or CRM logging, a single agent with well-designed tools is faster, cheaper, and more reliable. Default to one agent and earn the second one with a concrete reason.
Why are multi-agent systems often worse than a single LLM agent?
Multi-agent pipelines stack latency (five sequential calls instead of one), compound token costs because each agent re-reads context, and multiply error rates — a five-step pipeline of 95% reliable agents is only about 77% reliable end-to-end. Debugging is also painful because you have to trace failures across multiple agent hops. A single well-tooled agent avoids all of this for the majority of tasks.
How do I make a single AI agent more reliable without adding more agents?
Tighten the tool surface so each tool does one thing with an unambiguous schema, constrain output using JSON schemas instead of asking another LLM to validate it, and add deterministic guardrails like input validation, rate limits, and human approval for irreversible actions. If verification is truly needed, use one critic pass — two total LLM calls, not a loop. Finally, read fifty real tool-use traces to find the actual failure mode instead of tuning prompts by intuition.
What does a single well-tooled Claude agent look like in code?
It's one model call in a loop with a small set of tools. You define tools with clear JSON input schemas (e.g. lookup_customer, create_ticket, draft_reply), send the user message to Claude with the tools parameter, and loop: if the response is end_turn, return it; otherwise execute the requested tool calls and append the results as a user message. No planner, critic, or orchestrator agents are needed.
Is using different models for different steps considered multi-agent?
Technically yes, but it's really just routing — for example using a cheap fast model for classification and a frontier model for hard reasoning. This is a legitimate pattern and much simpler than a full multi-agent system with planners and critics. Treat it as model selection inside one workflow, not as agent-to-agent communication.
Work with BizFlowAI
If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.
Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.
More guides like this on the BizFlowAI blog.
Frequently asked questions
When should I use a multi-agent system instead of a single agent?
Use multi-agent only when sub-tasks are genuinely independent and parallelizable, when the work overflows a single context window, when different models are needed for cost or capability, when high-stakes output requires an adversarial critic pass, or when the workflow spans hours or days with human approval steps. For most business tasks like email triage, summarization, or CRM logging, a single agent with well-designed tools is faster, cheaper, and more reliable. Default to one agent and earn the second one with a concrete reason.
Why are multi-agent systems often worse than a single LLM agent?
Multi-agent pipelines stack latency (five sequential calls instead of one), compound token costs because each agent re-reads context, and multiply error rates — a five-step pipeline of 95% reliable agents is only about 77% reliable end-to-end. Debugging is also painful because you have to trace failures across multiple agent hops. A single well-tooled agent avoids all of this for the majority of tasks.
How do I make a single AI agent more reliable without adding more agents?
Tighten the tool surface so each tool does one thing with an unambiguous schema, constrain output using JSON schemas instead of asking another LLM to validate it, and add deterministic guardrails like input validation, rate limits, and human approval for irreversible actions. If verification is truly needed, use one critic pass — two total LLM calls, not a loop. Finally, read fifty real tool-use traces to find the actual failure mode instead of tuning prompts by intuition.
What does a single well-tooled Claude agent look like in code?
It's one model call in a loop with a small set of tools. You define tools with clear JSON input schemas (e.g. lookup_customer, create_ticket, draft_reply), send the user message to Claude with the tools parameter, and loop: if the response is end_turn, return it; otherwise execute the requested tool calls and append the results as a user message. No planner, critic, or orchestrator agents are needed.
Is using different models for different steps considered multi-agent?
Technically yes, but it's really just routing — for example using a cheap fast model for classification and a frontier model for hard reasoning. This is a legitimate pattern and much simpler than a full multi-agent system with planners and critics. Treat it as model selection inside one workflow, not as agent-to-agent communication.