Companies Are Slashing AI Budgets. Builders, This Is Your

The Financial Times reported it last week, and Hacker News piled on with a hundred-plus upvotes: enterprises are pulling back on AI spend because the unit economics stopped working. If your OpenAI invoice quietly passed your AWS bill last month, you're not crazy and you're not alone. What follows is the actual breakdown of why this is happening and the routing-layer playbook I've used on three client systems this quarter to cut bills 70–85% without touching output quality.
What's actually dying isn't AI. It's the lazy implementation.
The pattern is depressingly consistent. A team gets excited about GPT-4 class capability, wires it into every workflow that touches text, ships to production, and discovers six weeks later that they're burning $38k/month on something that replaces $15k of labor. The board meeting gets quiet. Then someone says "maybe we pause this."
Here's what I see in those systems when I audit them:
- One model for every task. Trivial classification, complex reasoning, JSON reformatting — all hitting the same expensive endpoint.
- Zero caching. The same prompt with the same input fires fresh every time, even when the answer can't possibly change.
- No batching. 10,000 single-row calls instead of 100 batched calls.
- No cost-per-task logging. Nobody can tell you which workflow is eating 60% of the bill, because nobody measured.
- No fallback path. When the model is overkill, there's no rules-based shortcut to bypass it.
The cuts aren't a verdict on AI. They're a verdict on engineering teams that treated LLM calls like free function calls. The market is correcting that. If you can show up and rebuild the same output for a tenth of the cost, you're not selling AI — you're selling sanity, and right now there are CFOs writing checks for exactly that.
Step one: instrument before you optimize anything
You cannot fix what you cannot see. Before you swap a single model or add a single cache, log every call. Model, input tokens, output tokens, latency, task label, success/failure. Run it for a week. Then sort by total cost descending.
Here's the minimal logger I drop into client codebases on day one:
import time, json, sqlite3
from openai import OpenAI
client = OpenAI()
db = sqlite3.connect("llm_costs.db")
db.execute("""
CREATE TABLE IF NOT EXISTS calls (
ts REAL, task TEXT, model TEXT,
in_tok INT, out_tok INT, cost_usd REAL, ms INT
)
""")
# Per-1M token pricing — update as providers change it
PRICE = {
"gpt-4o": (2.50, 10.00),
"gpt-4o-mini": (0.15, 0.60),
"claude-3-5-sonnet":(3.00, 15.00),
"claude-3-5-haiku": (0.80, 4.00),
}
def call(model, messages, task):
t0 = time.time()
r = client.chat.completions.create(model=model, messages=messages)
ms = int((time.time() - t0) * 1000)
u = r.usage
pin, pout = PRICE[model]
cost = (u.prompt_tokens * pin + u.completion_tokens * pout) / 1_000_000
db.execute("INSERT INTO calls VALUES (?,?,?,?,?,?,?)",
(time.time(), task, model, u.prompt_tokens, u.completion_tokens, cost, ms))
db.commit()
return r.choices[0].message.content
After seven days, run one query:
SELECT task, model, COUNT(*) AS n,
ROUND(SUM(cost_usd), 2) AS spend,
ROUND(AVG(cost_usd), 4) AS per_call
FROM calls
GROUP BY task, model
ORDER BY spend DESC;
Every audit I've done lands on the same shape: 70–80% of spend concentrated in 10% of call types. The top three rows are where all the money goes. Those are also, almost without exception, the easiest calls to downgrade.
Step two: route by task difficulty, not by habit
The single architectural change that delivers 5–10x cost reduction is a router. One function in front of your LLM layer that decides which model handles which call. Most teams skip this because it feels too simple to matter. It matters more than anything else you'll do.
Concrete cost gap to keep in mind:
| Task | Wrong choice | Right choice | Cost ratio |
|---|---|---|---|
| Gmail label classification | Claude Sonnet 3.5 | Haiku or local Llama 3.1 8B | ~20x |
| Invoice line-item extraction | GPT-4o | gpt-4o-mini + JSON mode | ~17x |
| "Is this email spam?" | Any LLM | regex + sender allowlist | ∞ |
| Multi-step contract review | gpt-4o-mini | GPT-4o or Sonnet | use the big one |
| Summarize 200-word ticket | GPT-4o | Haiku / Mistral 7B | ~15x |
The router itself is unglamorous code:
def route(task: str, payload: dict) -> str:
# 1. Rules first — no LLM needed
if task == "spam_check":
if payload["sender"] in ALLOWLIST: return "ham"
if any(k in payload["subject"].lower() for k in SPAM_KW): return "spam"
# 2. Cache hit?
key = hash_payload(task, payload)
if cached := cache.get(key):
return cached
# 3. Tiered model selection
if task in CHEAP_TASKS: model = "gpt-4o-mini"
elif task in MEDIUM_TASKS: model = "claude-3-5-haiku"
elif task in HARD_TASKS: model = "gpt-4o"
else: model = "gpt-4o-mini" # default cheap
out = call(model, build_prompt(task, payload), task)
cache.set(key, out, ttl=86400)
return out
That's it. Three tiers, a rules shortcut, and a cache. On a Gmail triage system I rebuilt for a 6-person client last quarter, this dropped the monthly bill from $2,840 to $310. Same triage accuracy measured on a 500-email holdout set.
Where the savings actually come from
- Rules layer absorbing 30–50% of trivial calls before they hit any model
- Cheap model handling another 40% with no quality drop
- Cache eliminating duplicate work on recurring inputs (newsletters, automated alerts, recurring vendor emails)
- Expensive model reserved for the 10–15% of calls that actually need reasoning
Step three: cache aggressively and batch everything you can
Caching is the lowest-effort, highest-return move in the playbook, and almost nobody does it properly. If your system answers the same question twice with the same input, the second call should cost zero. Period.
Two layers worth setting up:
# Exact-match cache — for deterministic prompts
import hashlib, redis
r = redis.Redis()
def cache_key(model, messages, temp):
blob = json.dumps([model, messages, temp], sort_keys=True)
return "llm:" + hashlib.sha256(blob.encode()).hexdigest()
# Semantic cache — for "close enough" prompts (FAQ, support)
# Embed the query, look up nearest neighbor in a vector store,
# return cached answer if cosine similarity > 0.95
For batching: the OpenAI Batch API and Anthropic's Message Batches API are both ~50% cheaper than synchronous calls, with a 24-hour SLA. Anything that doesn't need a real-time answer — overnight enrichment, nightly summary generation, bulk classification — belongs in a batch job. I have one client running 180k product description rewrites per month through batch endpoints. Sync cost would be $1,460. Batch cost is $730. Same output.
Quick cache wins I look for first
- Recurring vendor emails (Stripe receipts, GitHub notifications, SaaS billing) — same template every time
- FAQ-style support questions — semantic cache catches paraphrases
- Document chunks that get re-summarized across multiple workflows
- Any "extract structured data from this template" task where the template is stable
Step four: push the easy 80% to small or local models
The honest reality of late-2024 small models: gpt-4o-mini, Claude Haiku, Llama 3.1 8B, Qwen 2.5 7B, and Mistral Small all handle classification, extraction, summarization, and rewriting at quality indistinguishable from GPT-4 class on those specific tasks. The gap only opens up on multi-step reasoning, long-context synthesis, and code generation.
If you have a home server or even a decent workstation, local inference gets your marginal cost to near-zero. On my PC-PC box (WSL Ubuntu, single consumer GPU), I run Qwen 2.5 7B via Ollama for the bulk of my UNA_Intel Gmail triage:
ollama pull qwen2.5:7b
ollama run qwen2.5:7b
import requests
def local_classify(text):
r = requests.post("http://localhost:11434/api/generate", json={
"model": "qwen2.5:7b",
"prompt": f"Classify this email as: urgent, normal, spam, newsletter.\n\n{text}\n\nLabel:",
"stream": False,
"options": {"temperature": 0}
})
return r.json()["response"].strip()
That call costs me electricity. Roughly $0.0001 amortized. Same call to GPT-4o is $0.003–0.015 depending on input size. At 5,000 calls/day that's the difference between $0.50/month and $450–2,250/month on a single workflow.
You don't need local for everything. You need it for the high-volume, low-complexity tail.
Step five: the routing layer is the moat for the next 18 months
Here's the take I'd bet money on: the next year and a half won't be won by whoever has the smartest model. Everyone will have access to comparable frontier models — the gaps are already shrinking quarter over quarter. What will separate working systems from money pits is the routing layer. Task-aware model selection, exact and semantic caching, batching, rules shortcuts, and small models doing the boring 80%.
The companies cutting AI budgets aren't AI pessimists. They're the leading edge of a market that's done paying for vibes and is now demanding engineering. That's a good market to be a builder in. The sales pitch writes itself: I'll cut your AI bill by 80% and keep the output quality. I've sent that exact sentence in three cold emails this quarter. Three replies, three audits, three contracts.
The playbook is boring. It works. Instrument, sort, route, cache, batch, downgrade. There's nothing clever in any of those steps individually. The clever part is that almost nobody does them in combination, which is why your inbox of potential clients is wider open than it's been in two years.
Why bizflowai.io helps with this
This is exactly the kind of work we package for clients at bizflowai.io. We come in, instrument your existing AI workflows for a week, produce a cost-per-task breakdown, and rebuild the highest-spend call paths with a routing layer — tiered models, semantic cache, batch jobs where latency allows, and local inference for high-volume trivial tasks. Most engagements land between 70% and 85% cost reduction at parity output quality, measured on a holdout set the client picks themselves so the numbers aren't gameable. No retainers, no vibes, just the bill before and the bill after.
Frequently asked questions
Why are enterprises cutting back on AI rollouts in 2024?
Enterprises are pulling back on AI rollouts because the unit economics of production-scale deployments aren't working. Pilot projects looked cheap, but running GPT-4 class models on every workflow request produces monthly invoices rivaling a junior engineer's salary. According to an FT report highlighted on Hacker News, teams are cutting bad AI implementations, not AI itself, and actively seeking engineers who can rebuild workflows at a fraction of the cost.
How do I reduce my company's AI API costs?
Instrument your AI system by logging every call: model used, input tokens, output tokens, and task type. Run it for a week, then sort by cost. Typically 70-80% of spend goes to about 10% of call types, which can usually be downgraded to a smaller model, a cached response, or a rules-based check. Routing hard tasks to expensive models and everything else to cheap or local models often yields a 5-10x cost reduction without quality loss.
What is an AI routing layer and why does it matter?
An AI routing layer directs each task to the most cost-appropriate model rather than sending every request to one expensive frontier model. It combines model selection per task, caching, batching, and small models handling roughly 80% of work. This architecture matters because the next 18 months of AI competition won't be won by access to the smartest model, but by whoever builds the best routing layer as their moat.
When should I use a small model vs a frontier model like GPT-4 or Claude Opus?
Use frontier models like GPT-4o or Claude Opus only for genuinely hard tasks requiring advanced reasoning. Route everything else—Gmail triage, invoice line-item extraction, simple classification, repetitive checks—to smaller, cheaper, or local models, or replace them with cached responses and rules-based logic entirely. The principle: expensive models for the hard 20%, cheap models for the trivial 80%. This single architectural change typically cuts AI bills by 80% with no quality loss.
Is the AI bubble bursting?
No, AI isn't dying—lazy AI implementation is. Companies cutting AI budgets aren't pessimists abandoning the technology; they're an early signal that the market is demanding engineering discipline instead of hype. The opportunity for builders and founders is real: offering to cut a company's AI bill by 80% while maintaining output quality is currently one of the most valuable sales pitches available, because enterprises are actively shopping for that exact solution.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why are enterprises cutting back on AI rollouts in 2024?
Enterprises are pulling back on AI rollouts because the unit economics of production-scale deployments aren't working. Pilot projects looked cheap, but running GPT-4 class models on every workflow request produces monthly invoices rivaling a junior engineer's salary. According to an FT report highlighted on Hacker News, teams are cutting bad AI implementations, not AI itself, and actively seeking engineers who can rebuild workflows at a fraction of the cost.
How do I reduce my company's AI API costs?
Instrument your AI system by logging every call: model used, input tokens, output tokens, and task type. Run it for a week, then sort by cost. Typically 70-80% of spend goes to about 10% of call types, which can usually be downgraded to a smaller model, a cached response, or a rules-based check. Routing hard tasks to expensive models and everything else to cheap or local models often yields a 5-10x cost reduction without quality loss.
What is an AI routing layer and why does it matter?
An AI routing layer directs each task to the most cost-appropriate model rather than sending every request to one expensive frontier model. It combines model selection per task, caching, batching, and small models handling roughly 80% of work. This architecture matters because the next 18 months of AI competition won't be won by access to the smartest model, but by whoever builds the best routing layer as their moat.
When should I use a small model vs a frontier model like GPT-4 or Claude Opus?
Use frontier models like GPT-4o or Claude Opus only for genuinely hard tasks requiring advanced reasoning. Route everything else—Gmail triage, invoice line-item extraction, simple classification, repetitive checks—to smaller, cheaper, or local models, or replace them with cached responses and rules-based logic entirely. The principle: expensive models for the hard 20%, cheap models for the trivial 80%. This single architectural change typically cuts AI bills by 80% with no quality loss.
Is the AI bubble bursting?
No, AI isn't dying—lazy AI implementation is. Companies cutting AI budgets aren't pessimists abandoning the technology; they're an early signal that the market is demanding engineering discipline instead of hype. The opportunity for builders and founders is real: offering to cut a company's AI bill by 80% while maintaining output quality is currently one of the most valuable sales pitches available, because enterprises are actively shopping for that exact solution.