12 AI Agent Examples Actually Running in Production

Most "AI agent" posts you'll find are hypothetical. Someone describes what an agent could do, slaps a LangChain diagram on it, and calls it a day. Meanwhile, you're trying to decide whether to spend a Saturday building one for your business — and you have no idea which use cases actually pay off versus which ones become weekend projects that die in a half-finished GitHub repo.
This post is the opposite. Twelve agent patterns I've either built, deployed for clients, or seen running profitably in small businesses. For each one: what it does, what it actually costs to run, whether you should build or buy, and where it breaks.
What counts as an "AI agent" (and what doesn't)
An AI agent is a program that uses an LLM to decide what to do next, then calls tools (APIs, databases, code) to do it — usually in a loop until a goal is met. That's it. A chatbot that answers questions is not an agent. A script that summarizes one email is not an agent. A system that reads an email, classifies it, drafts a reply, checks your calendar, and books a meeting is an agent.
The distinction matters because agents are dramatically more expensive to run and harder to debug than single-shot LLM calls. A 2024 Anthropic engineering post puts it bluntly: most production "agent" systems should actually be workflows with fixed steps, not autonomous loops. Reach for an agent when the path is unpredictable; reach for a workflow when it isn't.
Quick decision rule:
| Signal | Use workflow | Use agent |
|---|---|---|
| Input shape | Predictable | Highly variable |
| Steps known in advance | Yes | No |
| Tolerance for wrong action | Low | Medium |
| Cost per run sensitivity | High | Lower |
Now the twelve examples.
1-3: Customer support agents
Customer support is where agents earn their keep first. The work is text-heavy, repetitive, and high-volume — the exact shape LLMs handle well.
1. Tier-1 triage and reply drafter. Reads inbound tickets, classifies them (refund / bug / how-to / billing), pulls the relevant customer record, drafts a reply, and sends it to a human queue. The human approves or edits, then ships. This is the highest-ROI agent for almost any SMB with >50 tickets/week. Build cost: a weekend. Running cost: a few cents per ticket on Claude Haiku or GPT-4o-mini.
2. Documentation search agent. Indexes your help docs, runbooks, and past tickets, then answers customer questions inline. The non-obvious part: chunking matters more than the model. Per-page chunking on PDFs and section-level chunking on Markdown beats fancy embeddings strategies for most SMB-sized corpora (under ~10k documents).
3. Refund / policy enforcement agent. Reads a refund request, checks the order against your refund policy (encoded as rules or as a prompt), then either auto-approves, escalates, or denies with explanation. Build vs. buy: buy for standard ecommerce (Gorgias, Intercom Fin all do this). Build if your policy is custom or you're not on a supported platform.
# Minimal triage loop - the core pattern is this small
def triage(ticket):
category = llm.classify(ticket.body, labels=CATEGORIES)
context = db.lookup_customer(ticket.email)
draft = llm.draft_reply(ticket, category, context, policy=POLICY)
return {"category": category, "draft": draft, "needs_human": confidence < 0.85}
The mistake I see most often: skipping the human-in-the-loop step on day one. Don't. Run for 2-4 weeks with humans approving every reply, log the edits, then use those edits to refine the prompt before going auto-send.
4-6: Sales and prospecting agents
Sales agents are where most people get burned. The temptation is to build an agent that "does outbound for you," which usually means generating spam that hurts your domain reputation. The agents that actually work are narrower.
4. Lead enrichment agent. Takes a name + company, finds the LinkedIn, pulls recent funding news, checks tech stack on BuiltWith, summarizes into a one-paragraph brief your AE reads before a call. Saves 10-15 minutes per call. Stack: Clay or Apollo for data + Claude/GPT for summarization. Build cost: a day. Buy alternative: Clay ($149+/mo) does this out of the box.
5. Meeting prep + follow-up agent. Listens to a call (via Fireflies/Otter), extracts action items, drafts the follow-up email, creates CRM tasks. Buy: Gong, Clari, Fathom. Don't build this unless you have a very specific reason — the buy options are mature and cheap.
6. Personalized outbound drafter (not sender). Reads a prospect's recent LinkedIn posts, their company blog, and a trigger event, then drafts a first draft of an outbound email. A human always reviews. The agent does the research; the human owns the send. This is the only outbound agent pattern I've seen consistently produce reply rates above generic templates without nuking sender reputation.
According to HubSpot's 2024 State of Marketing report, marketers using AI for content creation report saving 2.5 hours per day on average — almost all of that on first drafts, not finished output. Treat your sales agent the same way.
7-9: Data and ops agents
This is the category where SMBs leave the most money on the table, because the work is boring and the wins are invisible until you look at the timesheet.
7. Invoice + receipt extraction agent. Pulls PDFs and images from an inbox, extracts line items, vendor, date, amount, and pushes them into QuickBooks or a Google Sheet. The catch I documented in a previous post: sending PDFs as images burns money fast. Per-page text chunking is roughly 6x cheaper at similar accuracy.
# A real config pattern from a client setup
pipeline:
trigger: gmail.label("invoices")
steps:
- extract: pdf_to_text_per_page
- classify: invoice | receipt | other
- parse: {vendor, date, total, line_items}
- validate: total == sum(line_items)
- route:
if validate_fail: human_review_queue
else: quickbooks.create_bill
8. CRM hygiene agent. Scans your CRM weekly for duplicate contacts, missing fields, stale opportunities, and broken email syntax. Auto-fixes the safe stuff, flags the rest. Build cost: a couple of days if your CRM has a decent API (HubSpot, Salesforce, Pipedrive all qualify). This one's almost always build, not buy — the off-the-shelf tools either cost enterprise money or don't fit your specific CRM hygiene rules.
9. Reporting agent. Pulls numbers from Stripe, your ad platforms, and your CRM every Monday morning, writes a one-page narrative summary, and emails it to you. Not a dashboard — a narrative. "Revenue up 8% WoW, driven by two large invoices from existing customers; new MRR flat; ad spend efficiency dropped on Google due to a new campaign launched Wednesday." This replaces the report nobody reads with the paragraph everybody does.
Build vs. buy for data agents:
| Use case | Buy if... | Build if... |
|---|---|---|
| Invoice extraction | <100 invoices/month, standard formats | Custom formats, >500/month, or unusual ERP |
| CRM hygiene | Salesforce + enterprise budget | Anything else |
| Reporting | You just need dashboards | You want narrative + custom logic |
10-12: Internal IT and engineering agents
These are the agents I personally use most, because they target my own workflow.
10. On-call runbook agent. Listens to alerts (PagerDuty, Sentry), pulls the relevant runbook, executes the safe diagnostic steps (check logs, check dashboards, check recent deploys), and posts a summary to Slack before a human even opens their laptop. Often the human just needs to confirm "yes, restart the worker." This shaves the first 5 minutes off every page, which is the most expensive 5 minutes of your week.
11. Code review agent. Reviews PRs for the boring stuff: unused imports, missing tests, inconsistent error handling, security anti-patterns. Buy options like CodeRabbit and Greptile are mature. Don't build this unless you have very specific in-house conventions a generic tool can't learn from your codebase.
12. Internal helpdesk agent. Answers "how do I get access to X" / "where's the brand kit" / "what's the WiFi password" in Slack. Indexes your Notion/Confluence/Google Drive. The 80/20 here is just good retrieval — the LLM part is almost trivial. According to Atlassian's 2024 State of Teams report, knowledge workers spend roughly a day a week searching for information. Even a mediocre helpdesk agent recovers a chunk of that.
# What a healthy agent observability stack looks like
- Every tool call logged with inputs, outputs, latency, cost
- Every LLM call logged with model, tokens, prompt version
- Daily report: cost per agent run, error rate, human-override rate
- Weekly: sample 20 random runs and grade them
If you skip the observability, you will not know your agent is broken until a customer tells you. I learned this the hard way on a triage agent that started misclassifying refund requests as "general inquiry" after a silent model update.
Build vs. buy: the actual decision framework
People want a rule. Here's the one that's held up across about 30 client projects:
Buy when: the use case is generic across your industry, the buy option integrates with your existing stack, and you have no proprietary data or logic that gives you an edge. Customer support tools, meeting recorders, code review bots — buy these.
Build when: the agent needs to touch 3+ internal systems with custom logic, your data or process is the moat, or the off-the-shelf tools cost more than $500/month for what is fundamentally a wrapper around an LLM call. CRM hygiene, internal helpdesk, custom invoice flows, narrative reporting — build these.
Don't build when: you don't have a person who can debug a Python stack trace at 11pm. An agent without an owner becomes shadow IT that everyone slowly stops trusting.
A useful gut check from a16z's enterprise AI breakdown: the buy-side agent market is consolidating fast in customer support, sales enablement, and code tooling. If you're building in those categories, you're competing with well-funded products. Building in ops, finance, and internal knowledge — those categories are still wide open for custom work, because every business does them differently.
What breaks in production (and how to catch it)
Every agent on this list will fail in one of these four ways. Plan for it:
- Silent model regressions. Provider updates a model, your prompts start producing slightly different outputs, accuracy drops. Mitigation: pin model versions explicitly, run a daily eval set of 20-50 known inputs, alert on drift.
- Cost runaway. An agent gets into a tool-call loop and burns $50 in an afternoon. Mitigation: hard cap on tool calls per run, alert on cost-per-run anomalies, kill-switch on daily spend.
- Permission creep. Agent gets write access "temporarily" and never gives it back. Mitigation: every agent gets a dedicated service account with minimum scopes, audited quarterly.
- Confidently wrong answers. The classic. The agent makes up an invoice number, a refund policy, or a customer's order history. Mitigation: ground every factual claim in a retrieved document, return citations, refuse when retrieval fails. Never let the agent answer from "knowledge."
How BizFlowAI approaches this
Most of the agents on this list are things I've built for clients running 1-10 person teams. The build pattern is consistent: start with the human-in-the-loop version, run for 2-4 weeks, measure where the human actually edits the output, then ratchet down the human involvement only on the cases where the agent is already reliable. The agents that fail are the ones shipped fully autonomous on day one.
If you're trying to figure out which of the twelve is worth building first for your specific business, the honest answer is: whichever one is currently costing you the most hours per week. Map that, then pick. The tooling matters less than picking the right target.
Work with BizFlowAI
If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.
Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.
More guides like this on the BizFlowAI blog.
Frequently asked questions
What's the difference between an AI agent and a workflow?
An AI agent uses an LLM to decide what to do next and calls tools in a loop until a goal is met, while a workflow follows fixed, predetermined steps. Use a workflow when the path is predictable and cost-sensitive; use an agent when input is highly variable and the steps aren't known in advance. Most production systems labeled as agents should actually be workflows because they're cheaper to run and easier to debug.
Which AI agent has the highest ROI for small businesses?
A tier-1 customer support triage and reply drafter is typically the highest-ROI agent for any SMB handling more than 50 tickets per week. It classifies inbound tickets, pulls customer context, drafts replies, and sends them to a human queue for approval. Build cost is roughly a weekend, and running cost is a few cents per ticket on cheaper models like Claude Haiku or GPT-4o-mini.
Should I build or buy an AI agent for invoice extraction?
Buy an off-the-shelf solution if you process fewer than 100 invoices per month in standard formats. Build a custom agent if you have custom formats, more than 500 invoices per month, or an unusual ERP system. Critically, extract PDFs as per-page text rather than sending them as images — it's roughly 6x cheaper at similar accuracy.
Why do AI sales outbound agents usually fail?
Fully autonomous outbound agents typically generate spam at scale, which hurts domain reputation and reply rates. The pattern that works is narrower: an agent researches the prospect's recent posts, company blog, and trigger events, then drafts a first email that a human reviews and sends. The agent handles research; the human owns the send decision.
What should I log when running AI agents in production?
Log every tool call with inputs, outputs, latency, and cost, plus every LLM call with model, tokens, and prompt version. Generate a daily report showing cost per agent run, error rate, and human-override rate. Weekly, manually sample around 20 runs to catch silent failures. Also start with human-in-the-loop approval for 2-4 weeks before enabling auto-send, using logged edits to refine prompts.