Model Routing in Local-First Agent Stacks

You're paying for Claude Opus to summarize a 200-word Slack thread. You're sending a multi-step research task to GPT-4o-mini because it was the default in your orchestration template. Six months in, your token bill is 3x what it should be and your agents still pick the wrong model for the job half the time.
This is the problem Mindstone's Rebel — an agentic AI operating system that launched this week under a Fair Source license — is trying to solve at the infrastructure level. Local-first, free for teams under 100, with model routing baked in. It's worth looking at not because it's the only answer, but because it points at a pattern small teams should be copying: the orchestration layer, not the model, is where you get leverage.
Why per-task model routing matters more than picking "the best" model
The dominant mistake in agent design right now is treating model selection as a one-time decision. You pick Claude Sonnet 4.5 or GPT-5 or Gemini 2.5, you write a system prompt, you ship. Every step of every agent run hits the same endpoint regardless of whether it's classifying an email or refactoring a 4,000-line file.
The honest math: most agent steps are trivial. Classification, routing, JSON extraction, summarization of short text, yes/no decisions — these run fine on small models that cost roughly an order of magnitude less per million tokens than the frontier. Only the hard reasoning steps (planning, code generation, multi-hop research, long-context synthesis) need the expensive endpoint.
If 80% of your agent's steps are trivial and you route them to a small fast model, you cut spend dramatically without touching quality on the work that actually matters. This is the whole game. Everything Rebel and similar OS-layer tools do — model registry, capability metadata, auto-routing, fallback chains — is in service of making that split happen automatically instead of by hand.
What Rebel actually does (and what "local-first" buys you)
Rebel is an agentic OS — a runtime layer that sits between your agents and the model providers, handling routing, tool calls, memory, and execution. The two design choices that matter for solo operators and small teams:
Local-first execution. The runtime lives on your hardware or in your VPC. Agents call models over the network when they need a cloud LLM, but orchestration, state, logs, and tool calls stay local. For anyone handling client data, medical records, financial information, or anything covered by an NDA, this removes the "is our orchestration vendor reading our prompts" question entirely.
Fair Source license. Free to use, modify, and ship internally for teams under 100. Above that threshold, commercial terms kick in. This is not open source in the OSI sense, but for a 1-10 person operation it's effectively free forever. Check Mindstone's current license page before you build production dependencies on the licensing terms.
The "local-first" framing matters more than it sounds. Most agent platforms — n8n cloud, Zapier, Make, Lindy — store your workflow state and intermediate data on their servers. When the IRS asks where the data lives, when a healthcare client asks for a BAA, when an enterprise prospect runs you through a SOC 2 questionnaire, "our orchestration vendor has it" is a friction point. "It runs on our own infrastructure" is not.
A practical model routing taxonomy
Before you wire up any router — Rebel's or your own — you need a mental model of which tasks go where. Here's the taxonomy I use with clients:
| Task class | Examples | Model tier | Rough cost class |
|---|---|---|---|
| Classification / routing | Intent detection, tag this email, is this a refund request | Small (Haiku, Gemini Flash-Lite, GPT-class mini) | Cheapest |
| Extraction | Pull line items from invoice, parse a resume to JSON | Small with structured output | Cheapest |
| Short summarization | Summarize a Slack thread, daily standup digest | Small to mid | Low |
| Drafting | First-pass email reply, marketing copy variant | Mid (Sonnet-class, GPT-class standard) | Medium |
| Multi-step planning | Decompose a research goal, plan a refactor | Frontier reasoning (Opus, GPT-5, Gemini Pro) | High |
| Code generation (non-trivial) | Implement a feature across multiple files | Frontier reasoning | High |
| Long-context synthesis | Read 50 docs, produce a memo | Frontier with large context | High |
The router's job is to look at the agent step, classify it against this taxonomy, and dispatch. The classification itself runs on a small model — meta, but cheap.
A minimal router you can build in a day
You don't need Rebel to start doing this. Here's a stripped-down router pattern in Python, the kind of thing I drop into client codebases before we commit to any specific orchestration platform:
from dataclasses import dataclass
from typing import Literal
TaskClass = Literal["classify", "extract", "summarize", "draft", "plan", "code", "synthesize"]
@dataclass
class ModelChoice:
provider: str
model: str
max_tokens: int
ROUTING_TABLE: dict[TaskClass, ModelChoice] = {
"classify": ModelChoice("anthropic", "claude-haiku", 1024),
"extract": ModelChoice("anthropic", "claude-haiku", 2048),
"summarize": ModelChoice("anthropic", "claude-haiku", 2048),
"draft": ModelChoice("anthropic", "claude-sonnet", 4096),
"plan": ModelChoice("anthropic", "claude-opus", 8192),
"code": ModelChoice("anthropic", "claude-opus", 8192),
"synthesize": ModelChoice("openai", "gpt-class-pro", 16384),
}
def route(task_class: TaskClass, context_tokens: int) -> ModelChoice:
choice = ROUTING_TABLE[task_class]
# Escalate to large-context model when input is heavy
if context_tokens > 100_000:
return ModelChoice("openai", "gpt-class-pro", 16384)
return choice
That's the core. The interesting work happens around it: classifying the task accurately, logging which model handled which step, and adding fallback when a provider rate-limits you.
Here's the classification step — note it runs on the cheapest model in your stack:
async def classify_task(step_description: str) -> TaskClass:
prompt = f"""Classify this agent step into exactly one category:
- classify: intent detection, tagging, yes/no decisions
- extract: pulling structured data from unstructured text
- summarize: condensing text under 5k tokens
- draft: writing emails, posts, or short copy
- plan: decomposing goals into subtasks
- code: generating or modifying code
- synthesize: reading large context and producing analysis
Step: {step_description}
Return only the category name."""
response = await small_model_call(prompt, max_tokens=10)
return response.strip().lower()
You can build the whole thing in 200 lines. The reason platforms like Rebel exist is to give you observability, fallback chains, retries, memory, and a UI on top — not because the routing itself is hard.
Fallback chains, the part everyone forgets
A router that calls one provider per task class will break the first time Anthropic has a partial outage or OpenAI rate-limits you mid-run. Production routing means fallback chains, ordered by preference:
# routing.yaml
plan:
primary: anthropic/claude-opus
fallbacks:
- openai/gpt-5
- google/gemini-2.5-pro
retry_on: [rate_limit, server_error, timeout]
max_attempts: 3
code:
primary: anthropic/claude-opus
fallbacks:
- openai/gpt-5
retry_on: [rate_limit, server_error]
max_attempts: 2
classify:
primary: anthropic/claude-haiku
fallbacks:
- openai/gpt-class-mini
- google/gemini-flash-lite
retry_on: [rate_limit, server_error, timeout]
max_attempts: 4
Two principles to internalize:
- Trivial tasks should have aggressive fallback because losing a classification step kills the whole agent run, and any small model can substitute.
- Reasoning tasks should have conservative fallback because GPT-5's plan looks different from Opus's plan, and switching mid-workflow can produce inconsistent output. Sometimes the right answer is to fail fast and surface an error to the human.
Rebel handles this at the OS level. If you're rolling your own, a 50-line wrapper around your provider SDKs gets you 90% of the way there. The remaining 10% — circuit breakers, exponential backoff with jitter, per-tenant rate budgeting — only matters once you're running real volume.
Memory: which model "remembers" the routing decision
Here's where things get interesting and where Rebel's framing as an "OS" starts to earn the name. Naive routing classifies each step in isolation. Smarter routing remembers: this user's invoices always parse cleanly with the small model, this client's contracts need the large-context model, this type of support ticket consistently needs escalation.
The pattern I use:
# Pseudocode for routing memory
async def route_with_memory(task: AgentStep, tenant_id: str) -> ModelChoice:
# Check if we've seen this task signature before
history = await routing_store.sessions_history(
tenant_id=tenant_id,
task_signature=task.signature(),
limit=10
)
if history and history.success_rate > 0.95:
# Trust the historical choice
return history.most_used_model
# No history or unreliable — classify fresh
task_class = await classify_task(task.description)
choice = route(task_class, task.context_tokens)
# Log the choice for future memory
await routing_store.record(tenant_id, task.signature(), choice)
return choice
This is where per-tenant or per-task model preferences emerge automatically. You don't write rules; the system learns them from outcomes. Over a few hundred runs, your agent gets cheaper and more reliable on its own.
The "task signature" can be as simple as a hash of (agent_name, step_type, input_schema). You don't need vector similarity for this — exact match on a structured signature works fine for 95% of small-team use cases.
Observability: the metric that matters
If you take one thing from this post: measure cost-per-successful-run, not cost-per-token.
Token cost is a vanity metric. A small model that fails and forces a retry on a large model costs more than just using the large model upfront. A large model that succeeds first time can be cheaper end-to-end than a chain of small-model attempts.
The minimum observability surface:
@dataclass
class RunMetrics:
run_id: str
agent: str
started_at: datetime
ended_at: datetime
steps: list[StepMetric]
outcome: Literal["success", "failure", "partial"]
total_cost_usd: float
total_latency_ms: int
@dataclass
class StepMetric:
step_class: TaskClass
model_used: str
fallback_triggered: bool
input_tokens: int
output_tokens: int
latency_ms: int
cost_usd: float
success: bool
Once you have this, you can answer the actual business questions: which agents cost the most per successful outcome, which model choices have the worst ROI, where fallbacks are firing too often. Without it, "we use Claude" or "we route to Haiku for cheap stuff" is a guess.
Rebel ships observability primitives out of the box; if you build your own, expect to spend a day or two on the metrics pipeline before you spend any time optimizing routes.
How BizFlowAI approaches this
I build orchestration layers for solopreneurs and small teams who need exactly what Rebel offers — local-first execution, per-task model routing, fallback chains, and per-tenant memory — but with the workflows wired to their specific business. For clients in healthcare, legal, and finance, that usually means running the runtime in their own AWS account or on a dedicated server, with model calls going out to a curated list of providers and everything else staying on their infrastructure. The router is the same pattern across clients; what differs is the task taxonomy and the fallback policy.
If you're at the stage where your token bill is climbing and you can't explain why, or you're being asked SOC 2 questions you can't answer because your orchestration vendor holds the state, that's the conversation to have. Rebel is a solid choice if you want to self-serve; if you'd rather have the routing layer built and handed to you wired into your existing stack, that's what I do.
What to do this week
Whether you adopt Rebel, build your own router, or use whatever's already in your stack, three concrete moves:
- Audit your current agent runs. For each step, ask: does this task class actually need the model it's calling? Most teams find 50-70% of steps can drop to a cheaper tier without quality loss.
- Build the cheapest possible router first. A dict mapping task classes to models is enough to start. Don't reach for a platform until you understand your own routing needs.
- Instrument before you optimize. Cost-per-successful-run, per agent, per task class. Without this number, every routing decision is opinion.
The shift happening at the OS layer — Rebel, similar agent operating systems, the model gateways from major clouds — is good for small teams. It pushes the smart-routing capability that used to require a platform engineering team down to where a solo operator can use it. The teams that pick it up early will be running agents that cost a fraction of their competitors' and degrade gracefully when any single provider has a bad day. The teams that don't will keep paying frontier prices to classify emails.
Work with BizFlowAI
If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.
Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.
More guides like this on the BizFlowAI blog.
Frequently asked questions
What is model routing in AI agent systems?
Model routing is the practice of automatically dispatching each step of an AI agent to a different LLM based on the task's complexity. Trivial steps like classification or extraction go to cheap small models (Haiku, Gemini Flash-Lite), while planning, code generation, and long-context synthesis go to frontier models (Opus, GPT-5, Gemini Pro). This typically cuts token costs significantly because most agent steps are trivial and don't need a frontier model. The routing decision itself usually runs on a cheap small model.
What is Mindstone Rebel and how does it work?
Mindstone Rebel is an agentic AI operating system launched in 2024 under a Fair Source license. It runs locally on your hardware or VPC, handling agent orchestration, model routing, tool calls, memory, and execution between your agents and cloud LLM providers. It's free for teams under 100 people, with commercial terms above that threshold. The local-first design keeps workflow state, logs, and intermediate data on your infrastructure rather than a vendor's servers.
How do I reduce LLM API costs in production agent workflows?
Route each agent step to the cheapest model that can handle it instead of sending everything to a frontier model. Classify the task type (classification, extraction, summarization, drafting, planning, code, synthesis) using a cheap small model, then dispatch to the appropriate tier. Add fallback chains so rate limits or outages don't break runs. A basic router in Python takes about 200 lines and can cut bills by 3x or more.
What is a Fair Source license?
Fair Source is a software license category that allows free use, modification, and internal deployment for small organizations (typically under 100 users), with commercial terms required above that threshold. It is not OSI-approved open source, but it functions as effectively free for solo operators and small teams. Mindstone Rebel uses this license model. Always check the current license terms before building production dependencies.
How should I design LLM fallback chains for production?
Trivial tasks like classification should have aggressive fallback across multiple providers because any small model can substitute and losing the step breaks the whole run. Reasoning tasks like planning or code generation should have conservative fallback because different frontier models produce inconsistent output, and sometimes failing fast with a human-facing error is correct. Specify primary model, ordered fallbacks, retry conditions (rate_limit, server_error, timeout), and max attempts per task class. Add circuit breakers and exponential backoff once you hit real volume.