Build an MCP Server for Your Internal API in One Afternoon

You have an internal REST API. Maybe it's the one that creates customers, pulls invoice status, or kicks off a shipment. Your team uses it through a dashboard, a few scripts, and the occasional Postman tab. Now you want Claude — or whatever assistant your team is using this quarter — to call it directly, safely, without you babysitting a giant system prompt full of curl instructions.
This is what MCP is for. And wrapping a single internal API in an MCP server is a job that fits comfortably into an afternoon if you make the right calls early. The rest of this post is the playbook: what to expose, how to shape the schemas, how to handle errors, and the small decisions that separate a tool the model actually uses from one it silently ignores.
What MCP actually is (and what it isn't)
The Model Context Protocol is a thin, transport-agnostic protocol for letting an LLM client discover and call tools, read resources, and use prompts that live in a separate process. The server you build advertises a typed catalog; the client (Claude Desktop, an IDE, your own agent runner) handles invocation and result rendering.
What this gives you in practice:
- One integration surface instead of N copy-pasted function-calling schemas.
- A clean process boundary: your MCP server holds the API keys, the LLM never sees them.
- Reusability across clients — the same server works in Claude Desktop, in an agent framework, or in a custom orchestrator.
What it does not give you:
- Authentication, rate limiting, or audit logging for free. You build those.
- A magic translation layer. If your API is bad, the tools will be bad. Wrapping ugly endpoints in MCP just gives the model a sharper knife to cut you with.
- Determinism. The model decides when and how to call your tools. Your job is to make the right call obvious and the wrong call hard.
Treat the MCP server as a product surface for a non-human user with strange habits: it reads docs literally, it hallucinates when fields are ambiguous, and it has no memory of last week's incident.
Decide what to expose before you write a line
The single biggest mistake on day one is mapping your REST endpoints 1:1 to MCP tools. A GET /customers/{id}, PUT /customers/{id}, POST /customers, DELETE /customers/{id} quartet becomes four tools, and now the model has to reason about HTTP semantics it shouldn't care about.
Instead, work backwards from the jobs an assistant should be able to do. Sit with the actual users for thirty minutes and write down the sentences they want to say out loud:
- "Find the open invoices for ACME from last quarter."
- "Resend the welcome email to this customer."
- "Create a draft refund for order 88421 but don't approve it."
Each sentence is a candidate tool. Now check:
- Is it safe to expose? Anything destructive without a confirmation step is a no.
- Does it map to one or two API calls? If it's six chained calls with branching, build a server-side handler. Don't make the model orchestrate.
- Is the result something the model can use? A 4 MB JSON dump isn't a result, it's a denial of service against the context window.
A good rule of thumb: aim for 5 to 15 tools on day one. Fewer than 5 and you're probably under-exposing. More than 15 and the model starts confusing them, especially if names overlap.
A minimal server in Python
The official mcp Python SDK gets you to a working server in well under 100 lines. Here's a skeleton that wraps an internal billing API with two tools: search invoices, and trigger a payment reminder.
import os
import httpx
from mcp.server.fastmcp import FastMCP
mcp = FastMCP("billing-internal")
API_BASE = os.environ["BILLING_API_BASE"]
API_TOKEN = os.environ["BILLING_API_TOKEN"]
client = httpx.AsyncClient(
base_url=API_BASE,
headers={"Authorization": f"Bearer {API_TOKEN}"},
timeout=10.0,
)
@mcp.tool()
async def search_invoices(
customer_id: str,
status: str = "open",
limit: int = 20,
) -> dict:
"""Search invoices for a customer.
Args:
customer_id: Internal customer ID, format CUS-XXXXX.
status: One of "open", "paid", "overdue", "void". Defaults to "open".
limit: Max results to return, 1-100. Defaults to 20.
Returns:
A dict with "invoices" (list) and "total_count" (int).
Each invoice has: id, amount_cents, currency, due_date, status.
"""
if status not in {"open", "paid", "overdue", "void"}:
return {"error": "invalid_status", "allowed": ["open", "paid", "overdue", "void"]}
limit = max(1, min(limit, 100))
r = await client.get("/v1/invoices", params={
"customer_id": customer_id,
"status": status,
"limit": limit,
})
if r.status_code == 404:
return {"invoices": [], "total_count": 0}
r.raise_for_status()
data = r.json()
return {
"invoices": [
{
"id": inv["id"],
"amount_cents": inv["amount_cents"],
"currency": inv["currency"],
"due_date": inv["due_date"],
"status": inv["status"],
}
for inv in data["items"]
],
"total_count": data["total"],
}
@mcp.tool()
async def send_payment_reminder(invoice_id: str, channel: str = "email") -> dict:
"""Send a payment reminder for an unpaid invoice.
Only works on invoices with status "open" or "overdue".
Returns an error if the invoice is already paid or void.
"""
if channel not in {"email", "sms"}:
return {"error": "invalid_channel", "allowed": ["email", "sms"]}
r = await client.post(f"/v1/invoices/{invoice_id}/reminders", json={"channel": channel})
if r.status_code == 409:
return {"error": "invoice_not_eligible", "detail": r.json().get("message")}
if r.status_code == 404:
return {"error": "invoice_not_found", "invoice_id": invoice_id}
r.raise_for_status()
return {"sent": True, "reminder_id": r.json()["id"]}
if __name__ == "__main__":
mcp.run()
Register it in Claude Desktop's config file and you're done with plumbing:
{
"mcpServers": {
"billing": {
"command": "python",
"args": ["-m", "billing_mcp"],
"env": {
"BILLING_API_BASE": "https://internal-billing.example.com",
"BILLING_API_TOKEN": "..."
}
}
}
}
That's the boilerplate. The real engineering is in the next three sections.
Schema design: write docstrings like an API contract
The model only sees what you put in tool names, parameter names, type hints, and the docstring. That's the entire prompt for tool selection. Treat the docstring as a contract, not commentary.
A few rules I apply on every server I ship:
Name tools as verbs in the user's language. search_invoices beats list_invoices_v1. create_refund_draft beats post_refund. The model is biased toward the most literal reading of the name.
Constrain enums in the type system, not just the docs. If your client SDK doesn't support typed enums cleanly, validate inside the function and return a structured error listing the allowed values (see the example above). The model recovers from these errors gracefully — it'll retry with a corrected value on the next turn.
Document the shape of the return value. The model doesn't see your response schema unless you describe it. Two sentences in the docstring saves three tool calls of exploration.
Mark dangerous tools explicitly. Put words like "destructive", "irreversible", or "requires human approval" in the docstring. Modern models do respect these hints.
Here's a comparison of two versions of the same tool description:
| Weak | Strong |
|---|---|
def get_user(id) |
def lookup_customer(customer_id: str) |
| "Get a user." | "Look up a customer by internal ID (format CUS-XXXXX). Returns name, email, plan, and account status. Returns {error: 'not_found'} if no match." |
| Returns raw API response | Returns a trimmed dict with a documented shape |
The weak version forces the model to guess at field names. The strong version lets it chain confidently into the next call.
Error surfaces that the model can recover from
REST APIs throw 4xx and 5xx codes. MCP tools should almost never throw. An exception bubbles up to the client as a generic tool error, and the model often gives up or hallucinates a workaround.
Instead, return structured errors as normal results:
return {
"error": "invoice_not_eligible",
"reason": "Invoice is already paid.",
"invoice_id": invoice_id,
"current_status": "paid",
}
The model reads this, understands the failure mode, and either tries a different approach or tells the user why it can't proceed. Compare that to a raw 409 stack trace, which produces a sad "I encountered an error" message.
Reserve real exceptions for things the model genuinely cannot fix: the upstream API is down, the credentials are revoked, the network is gone. Those should fail loudly so your monitoring catches them.
A useful pattern:
class ToolError(dict):
def __init__(self, code: str, message: str, **extra):
super().__init__(error=code, message=message, **extra)
# usage
return ToolError("rate_limited", "Try again in 30 seconds.", retry_after=30)
The retry hint matters. Models will respect a retry_after field if it's named obviously.
Auth, secrets, and the blast radius
Your MCP server is a privileged process. It holds an API token that can probably do a lot of damage. Three things to lock down before you let anyone connect:
Scope the token. If your API supports scoped tokens, mint one that can only hit the endpoints your tools actually call. If it doesn't, this is a great moment to add scoping. Either way, the token in the MCP server should not be the same token your admin dashboard uses.
Add a confirmation gate for destructive actions. The cleanest pattern is a two-tool dance: prepare_refund returns a draft ID and a summary, confirm_refund takes the draft ID and executes. The model can't accidentally skip the confirmation because it physically doesn't have the execution call until the prepare step returns. For high-stakes workflows, route confirm_* calls through a human approval queue (Slack message, email link) before the API call fires.
Log every tool call server-side. Tool name, arguments, response status, latency, and a request ID that traces back to your API logs. When the model does something weird at 2am, you'll want this. A few lines:
import logging, time, uuid
logger = logging.getLogger("mcp.billing")
async def _call(tool_name, fn, **kwargs):
rid = uuid.uuid4().hex[:8]
t0 = time.monotonic()
try:
result = await fn(**kwargs)
logger.info("tool=%s rid=%s ms=%d ok", tool_name, rid, int((time.monotonic()-t0)*1000))
return result
except Exception as e:
logger.exception("tool=%s rid=%s ms=%d err=%s", tool_name, rid, int((time.monotonic()-t0)*1000), e)
raise
Wrap your tools with this. It's two hours of work to set up properly and it's the difference between a debuggable system and a haunted one.
Test the way the model will actually use it
Unit tests are necessary but not sufficient. The interesting failure modes only show up when an LLM is in the loop. A short ritual that's caught real bugs in every server I've shipped:
- Write 10-15 representative prompts that should each trigger one or two tool calls. "Show me open invoices for ACME." "Send a reminder for invoice INV-99231 by SMS." "What's the total amount overdue for customer CUS-00123?"
- Run them through the actual client (Claude Desktop, or an agent loop using the same model in production).
- Inspect the tool call traces. Did the model pick the right tool? Did it pass the right arguments? Did it get confused between two similarly-named tools?
- For each failure, fix the schema, not the prompt. If the model used the wrong tool, the tool name or description was unclear. If it passed a malformed argument, the parameter description was ambiguous.
A handful of these iterations gets you from "kind of works" to "reliable enough to leave running." Save the prompts as a regression suite. Re-run them whenever you add or rename a tool.
One last test worth running: deliberately misuse the tools in a prompt and see what happens. "Delete all invoices." "Refund every customer." If the server has the right confirmation gates and scoping, these should produce safe refusals or contained drafts, not incidents.
How BizFlowAI approaches this
Most of our client engagements start exactly here: take one internal system — billing, CRM, an order management tool, a custom ticketing app — and wrap it in an MCP server that the team's existing assistant can call. We work from your API docs and your top five "I wish I could just ask the bot to do this" sentences, ship a working server in days rather than weeks, and put it behind real auth, logging, and approval gates.
If you have an internal API and a rough idea of the jobs you'd like to automate, bring both to a discovery call. The fastest way to find out whether MCP is the right shape for your problem is to sketch the tool list together — most of the design decisions in this post happen in the first hour of that conversation.
What to build next
Once one server is running and your team trusts it, the natural next moves are:
- Add a second server for an adjacent system (CRM if you started with billing, or vice versa). The model can use both in the same conversation, which is where workflows start getting interesting.
- Promote a few read-only tools to resources. MCP resources are good for data the model should always have access to without spending a tool call — a list of customer segments, a pricing table, a current on-call schedule.
- Wire approvals into Slack or email. The
prepare/confirmpattern is even more useful when the confirmation comes from a human who didn't initiate the conversation.
The afternoon you spend on the first server is the expensive one. Everything after that compounds.
Work with BizFlowAI
If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.
Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.
More guides like this on the BizFlowAI blog.