Alibaba's SkillWeaver Cuts Agent Tokens by 99%

By Lazar Milicevic · Published July 3, 2026 · 10 min read

Developer terminal showing AI agent code and tool schemas illustrating SkillWeaver token optimization

Your agent has 40 tools loaded into context. It picks the wrong one on step 3, hallucinates a JSON argument, and burns 80K tokens before hitting the retry ceiling. You've been debugging the same class of failure for a week: too many options, too much context, too little signal for the model to route correctly.

This is the exact problem Alibaba's research team went after with SkillWeaver. The reported result — up to 99% reduction in token consumption on complex agent workflows — sounds like marketing, but the mechanism is unglamorous and correct: stop dumping every tool schema into the prompt on every step. Build a graph, load only what the current node needs. If you run agents in production, this pattern is worth stealing whether you use their framework or not.

What SkillWeaver actually does

SkillWeaver is a routing framework that treats a workflow as an execution graph instead of a flat tool list. Given a task, it plans the graph up front — nodes are subtasks, edges are dependencies — and each node is bound to a small set of relevant skills or tools. At runtime, the agent only sees the skills scoped to the current node. It never carries the full inventory in context.

The problem it solves is well-known to anyone running a serious agent stack. When you have 100+ tools registered — a common state for anything touching CRM, billing, calendar, email, docs, internal APIs — three things break at once:

Prompt cost balloons. Every tool definition (name, description, parameter schema) sits in the system prompt on every LLM call. For dense JSON schemas, each tool can run 200–800 tokens. 100 tools = 20K–80K tokens before the user even speaks.
Selection accuracy drops. LLMs get worse at picking the right tool as the candidate set grows. The classic finding from function-calling benchmarks: accuracy degrades noticeably past ~15–20 tools in the same call.
Retries compound. A wrong tool pick means a failed step, a retry, another full-context call, more tokens. The bill grows quadratically with workflow length.

SkillWeaver's answer: plan the graph, prune the context per node, execute. The 99% claim comes from workflows where the baseline dumped every skill into every call — a strawman baseline in one sense, but also exactly what a naive agent loop with bind_tools() does today.

Why "load every tool" is the default (and why it's wrong)

If you've built an agent with LangChain, LlamaIndex, or a hand-rolled loop around OpenAI's function calling, you've probably done this:

tools = [
    search_crm, create_deal, update_deal, get_contact,
    send_email, schedule_meeting, create_invoice,
    # ... 60 more
]

agent = create_agent(model, tools=tools, prompt=system_prompt)
result = agent.invoke({"input": user_request})

Under the hood, every tool's schema is serialized into the model's context on every turn of the reasoning loop. For a 10-step workflow with 60 tools, you're paying for 60 tool schemas × 10 turns = 600 tool descriptions in billed tokens. Anthropic and OpenAI both charge for this on every request unless you're using prompt caching — and even cached, the model still has to attend over all of it.

The default is wrong because most steps in most workflows only need 1–3 tools. A "reply to inbound lead" step needs the CRM lookup and the email send. It doesn't need the invoicing tool, the meeting scheduler, or the 40 internal endpoints. Loading them anyway isn't just wasteful — it actively hurts routing accuracy.

The three ideas worth stealing

You don't need to adopt SkillWeaver to benefit from what it demonstrates. The paper crystallizes three patterns that any serious agent builder should be running.

1. Plan before you execute

Instead of a ReAct-style loop where the model plans and acts on the same turn, separate planning from execution. A cheap planner call (often Haiku or a mini model) produces a DAG of subtasks. Execution then walks the DAG with a fresh, tightly-scoped context per node.

Rough shape:

plan = planner.invoke({
    "task": user_request,
    "available_skill_categories": skill_registry.categories()
})
# plan = [{"node": "lookup_lead", "skills": ["crm.search"]},
#         {"node": "score_lead", "skills": ["scoring.evaluate"]},
#         {"node": "draft_reply", "skills": ["email.draft", "crm.get_context"]}]

for node in plan:
    scoped_tools = skill_registry.load(node["skills"])
    executor.invoke(node, tools=scoped_tools)

The planner sees skill categories, not full schemas. The executor sees full schemas but only for its node.

2. Skill registry with lazy loading

Treat tools as a versioned catalog, not a bound array. Each skill has a lightweight descriptor (name, one-line purpose, category, cost tier) and a full schema loaded on demand.

{
  "id": "crm.search_contact",
  "category": "crm",
  "summary": "Find a contact by email or phone",
  "cost_tier": "cheap",
  "schema_ref": "schemas/crm/search_contact.v3.json"
}

The planner reads summaries. The executor pulls schemas only for the skills it will actually call. This is roughly what MCP (Model Context Protocol) enables when servers expose tools with discovery endpoints — you can query "what can you do?" without pulling every schema.

3. Cache the graph, not just the prompt

Because the same workflow types recur (every new inbound lead follows the same 4-node graph), the graph itself becomes cacheable. Skip the planner call entirely on recognized task shapes. This is where the 99% number gets its biggest lift — for repetitive workflows, you're not just cutting per-call token cost, you're cutting the planner call too.

Token math: what this actually saves

Let's ground the 99% claim in numbers you can reproduce.

Scenario	Tools loaded per call	Workflow steps	Tokens spent on tool schemas
Naive: bind everything	80 tools × ~400 tok	8 steps	~256,000 tokens
Category-based routing	15 tools × ~400 tok	8 steps	~48,000 tokens
Graph-scoped (SkillWeaver-style)	2–3 tools × ~400 tok	8 steps	~8,000 tokens

That's a 30x reduction from naive to graph-scoped, before accounting for the retry loop savings from better tool selection. If your naive baseline includes 5% wrong-tool retries, real-world savings often land in the 40–80x range on production workflows. The 99% headline assumes a maximally bloated baseline and a heavily repetitive workflow — plausible for enterprise ops automation, less so for one-off exploration.

The point is not the exact multiplier. The point is that the default "bind every tool" pattern is leaving 90%+ of your token spend on the floor as pure waste.

Where this pattern matches MCP and existing tooling

Anthropic's Model Context Protocol was designed with exactly this problem in mind. An MCP server exposes tools through a discovery interface: the client can list tools, get schemas on demand, and scope which servers are active per session. Claude Code uses this pattern natively — you don't load every possible MCP server into every session, you load the ones relevant to the current work.

If you're building on MCP today, SkillWeaver-style routing is already partially free. What's missing in most implementations is the graph planning layer on top. Most agents still ask the model to pick tools reactively rather than planning the full sequence up front. Adding a lightweight planner call before execution is usually a one-day change and pays back on the first production workflow.

The same pattern shows up in:

LangGraph — subgraphs let you scope tools per node explicitly, though you have to design the graph yourself.
CrewAI — role-based agents naturally scope tools per crew member, which is a coarse version of the same idea.
OpenAI's Assistants API — supports per-thread tool overrides, though most integrations ignore this and bind globally.

None of these give you SkillWeaver's automatic graph construction out of the box. But all of them can implement the pattern manually if you commit to it.

When this pattern hurts you

Every architectural choice has a failure mode. Graph-based routing has three worth calling out.

Exploratory / open-ended tasks. If the workflow genuinely can't be planned up front — the agent has to discover what to do as it goes — a graph planner will produce a bad plan and cost you an extra call. ReAct-style loops with a smaller tool set still win here. Use graph routing for workflows you've seen before; use reactive loops for genuinely novel work.

Small tool inventories. If you have 8 tools total, the schema overhead is 3–5K tokens. You're paying complexity for a rounding-error saving. The math only bites past ~20 tools.

Rapidly changing tools. If you're iterating on tool definitions daily, maintaining a skill registry with descriptors, schemas, and categories becomes drag. Stabilize the tool catalog before you invest in the routing layer.

The honest failure mode I've hit: over-scoped nodes. If your planner assigns 15 tools to a node "just in case," you're back to the naive baseline with extra steps. Discipline in the planner prompt matters more than the graph itself.

A minimal implementation you can ship this week

Here's the skeleton I'd start with for a real project. No framework required — just a planner call, a registry, and an executor loop.

from anthropic import Anthropic

client = Anthropic()

SKILL_REGISTRY = {
    "crm.search": {"summary": "Find contact by email", "schema": {...}},
    "crm.update": {"summary": "Update contact fields", "schema": {...}},
    "email.draft": {"summary": "Draft an email reply", "schema": {...}},
    "email.send": {"summary": "Send a drafted email", "schema": {...}},
    "calendar.book": {"summary": "Book a meeting slot", "schema": {...}},
    # ... 50 more
}

def plan_graph(task: str) -> list[dict]:
    summaries = {k: v["summary"] for k, v in SKILL_REGISTRY.items()}
    resp = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1024,
        system=f"Available skills: {summaries}. Output a JSON list of "
               f"nodes, each with 'goal' and 'skills' (max 3 skills per node).",
        messages=[{"role": "user", "content": task}]
    )
    return parse_json(resp.content[0].text)

def execute_node(node: dict, context: dict) -> dict:
    tools = [SKILL_REGISTRY[s]["schema"] for s in node["skills"]]
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        tools=tools,
        messages=[{"role": "user", "content": f"{node['goal']}\n\nContext: {context}"}]
    )
    return handle_tool_calls(resp, context)

def run(task: str):
    graph = plan_graph(task)
    context = {"task": task}
    for node in graph:
        context = execute_node(node, context)
    return context

That's the whole idea. A cheap planner picks the graph. A stronger model executes each node with only the tools it needs. Context flows forward through the nodes. You can add graph caching, parallel node execution, and retry logic once the basic loop is solid — but the token savings kick in immediately.

How BizFlowAI approaches this

We've been running scoped-tool routing on client agents for a while now — it's not because we saw Alibaba's paper, it's because the naive "bind everything" pattern breaks the moment a client's tool inventory crosses about 20 skills. Our standard stack builds an MCP-first skill registry per client, a lightweight planner call (usually Haiku-tier) that produces the execution graph, and per-node tool scoping so the executor model only sees what's relevant. The graph gets cached for recurring workflow shapes — inbound lead intake, invoice follow-up, support triage — which is where most of the real cost reduction comes from.

If you're running an agent in production and your token bill feels wrong, it probably is. Most of the audits we do surface the same pattern: 30–60 tools bound globally, no planning layer, retry loops eating half the spend. A short discovery call is usually enough to identify whether graph routing would move the number for your workflow, or whether the win is elsewhere (prompt caching, model tier, or just cutting three tools nobody uses).

What to do this week

If you take one thing from the SkillWeaver work, make it this: count the tools bound to your production agent, then count how many any single workflow step actually uses. The ratio is almost always 10:1 or worse. That ratio is your token savings ceiling.

Concrete steps in order of ROI:

Audit tool usage. Log which tools your agent actually calls per workflow. Cut anything unused for 30 days.
Group by category. Even a crude split — CRM tools, comms tools, billing tools — lets you scope by category without a full graph.
Add a planner call. Use a cheap model to produce a plan before executing. One extra call, big context savings downstream.
Cache repeated graphs. Fingerprint the task; skip the planner for known shapes.
Measure. Track tokens per workflow completion, not tokens per call. That's the number that shows up on the bill.

You don't need Alibaba's framework. You need the discipline to stop treating tool binding as a set-and-forget config. The agents that survive contact with production are the ones that carry the least context they can get away with.

Work with BizFlowAI

If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.

Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.

More guides like this on the BizFlowAI blog.

Frequently asked questions

What is Alibaba's SkillWeaver framework?

SkillWeaver is a routing framework from Alibaba's research team that treats an AI agent workflow as an execution graph instead of a flat tool list. It plans the graph up front, with nodes as subtasks and edges as dependencies, then loads only the tools scoped to the current node at runtime. Alibaba reports up to 99% reduction in token consumption on complex, repetitive workflows compared to naive agents that bind every tool into every call.

Why does loading every tool into an LLM agent's context hurt performance?

Loading all tools inflates the system prompt on every call, since each tool schema costs 200-800 tokens and 100 tools can add 20K-80K tokens before the user speaks. It also degrades tool selection accuracy, which drops noticeably past 15-20 candidate tools. Wrong picks trigger retries with full context, compounding token costs quadratically as workflow length grows.

How much can graph-scoped tool routing reduce agent token costs?

Naive agents binding 80 tools across 8 steps spend roughly 256,000 tokens on tool schemas alone. Category-based routing with 15 scoped tools drops this to about 48,000 tokens, and graph-scoped routing with 2-3 tools per node drops it to around 8,000 tokens. That is a 30x reduction before accounting for fewer retry loops, with real-world savings often reaching 40-80x.

How is SkillWeaver related to Model Context Protocol (MCP)?

MCP was designed for the same problem: servers expose tools through a discovery interface, so clients can list tools and fetch schemas on demand rather than loading everything. Claude Code already uses this pattern by activating only relevant MCP servers per session. What most MCP implementations lack is SkillWeaver's graph planning layer, which sequences tool calls up front instead of picking reactively.

When should you not use graph-based agent routing?

Avoid it for open-ended exploratory tasks where the workflow cannot be planned up front, since the planner will produce a bad plan and add a wasted call. Skip it for small tool inventories under about 20 tools, where schema overhead is only 3-5K tokens and complexity outweighs savings. Also avoid it when tool definitions change daily, because maintaining a skill registry with descriptors and categories becomes overhead.