Claude Hallucinated My Invoice Totals — Until I Switched to

Abstract tech illustration: Claude Hallucinated My Invoice Totals — Until I Switched to

Claude quietly faked the total on a client invoice last month, and I almost shipped it. The PDF looked clean, the table was readable, and Claude returned a confident wrong number. If you're feeding Claude invoices, statements, or any document with rows and columns, you're hitting this silent failure rate right now and you probably don't know it.

The failure that almost shipped

The setup was boring. A small accounting firm came to us drowning in supplier invoices — roughly 200 PDFs a week, each one with a line-item table, each one needing to land in their books as structured JSON. The obvious move, the one every tutorial demos, is to drop the PDF straight into Claude and ask for the line items.

So I did that.

On a clean-looking invoice with seven rows, Claude returned six. It merged two rows that had similar descriptions ("Cable assembly 2m" and "Cable assembly 3m" both with the same unit price). The unit prices it returned were right. The total it returned was wrong by €42. The response sounded completely confident — no warning, no hedge, no "confidence": "low" field anywhere.

That's the dangerous part. A wrong total that looks right gets booked, gets paid, gets reconciled, and then three months later somebody finds it during an audit. By that point the supplier has been paid the wrong amount fourteen times.

Why PDFs break LLM extraction (it's not Claude's fault)

This isn't a Claude bug. It's how PDFs work.

A PDF doesn't actually contain a table. It contains text positioned at x-y coordinates on a page. When you open a PDF and see a neat grid, you're looking at characters that happen to be drawn at coordinates that line up visually. There is no <table>, no <tr>, no row separator in the file. The "table" only exists in your brain.

When Claude reads a PDF, it sees a stream of characters and has to guess where rows start and end based on spacing. On a one-page receipt with generous margins, it guesses right. On a real invoice, it guesses wrong:

  • Tight row spacing — two adjacent rows get fused into one
  • Merged cells for VAT breakdowns — multi-line descriptions get attached to the wrong row
  • Totals flush against the last line item — the total gets treated as another line
  • Multi-page invoices — the header row on page 2 gets read as data

And it guesses wrong silently. No exception, no flag. Just a confident JSON response with the wrong numbers in it.

Where layout-blind extraction breaks down

  • Multi-line descriptions inside a single row
  • Right-aligned numeric columns that drift left when totals appear
  • VAT summary blocks that look like line items
  • Page footers with totals that get included as a row
  • Scanned/rotated pages where text order is non-linear

The benchmark: 200 real invoices

I ran the test on the actual client workload. Two hundred real supplier invoices, same Claude model, same prompt, same JSON schema. The only variable was input format.

Input format Fully correct At least one error Cost per 200 invoices
Raw PDF upload 71% 29% ~$40/month
Markdown via pymupdf4llm 97% 3% ~$6/month

The 29% error rate on raw PDFs broke down roughly as:

  • ~14% had at least one row merged with an adjacent row
  • ~9% dropped a column (usually quantity or VAT rate)
  • ~6% had the total miscalculated or pulled from the wrong field

Not acceptable for accounting. Not even close.

The remaining 3% after the fix were genuinely broken source documents — scanned crooked, partially handwritten, or photographed at an angle. Those need OCR, which is a different problem.

The eight-line fix

Here's the fix, and it's almost embarrassingly small. Instead of sending Claude the PDF, convert the PDF to Markdown first, locally, using pymupdf4llm. Markdown preserves table semantics: pipes for columns, dashes for headers, one row per line. Claude reads Markdown tables natively because it was trained on millions of them from GitHub READMEs and documentation sites. The structure is unambiguous.

pip install pymupdf4llm anthropic
import pymupdf4llm
from anthropic import Anthropic

client = Anthropic()

def extract_invoice(pdf_path: str) -> dict:
    md = pymupdf4llm.to_markdown(pdf_path)
    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2000,
        messages=[{
            "role": "user",
            "content": f"Extract line items as JSON.\n\n{md}"
        }],
    )
    return resp.content[0].text

That's it. No OCR, no layout model, no paid extraction API. Runs in about 200ms per invoice on a normal laptop. The Markdown that comes out looks like this:

| Description       | Qty | Unit price | VAT | Total  |
|-------------------|-----|------------|-----|--------|
| Cable assembly 2m | 4   | 12.50      | 20% | 60.00  |
| Cable assembly 3m | 2   | 18.00      | 20% | 43.20  |
| Mounting bracket  | 10  | 3.20       | 20% | 38.40  |

**Subtotal: 141.60**
**Total: 141.60**

Claude has zero ambiguity about where one row ends and the next begins. The pipes are explicit. The model isn't doing layout analysis anymore — it's just reading text it understands perfectly.

Why this also drops token cost 6.4×

Here's the part nobody mentions in the "feed Claude your PDFs" tutorials. A PDF uploaded to Claude is processed internally as page images. Each page burns vision tokens, and vision tokens are expensive — a single A4 page can cost 1,500-2,000 tokens depending on density.

Markdown is plain text. Cheap text tokens. A typical invoice that costs ~1,800 vision tokens as a PDF comes out to ~280 text tokens as Markdown.

On the client's actual volume:

Metric Raw PDF Markdown
Avg input tokens / invoice ~1,800 (vision) ~280 (text)
Invoices / week 200 200
Monthly cost (Sonnet 4.5) ~$40 ~$6
Avg latency / invoice 4-7s 1-2s
Accuracy 71% 97%

Cheaper, faster, more accurate. There is no tradeoff here. This is one of the rare engineering decisions where every axis moves in the right direction at the same time.

Why this actually works (and where it breaks)

The reason this works is conceptually simple. You're not asking the model to do layout analysis and extraction at the same time. You're doing the layout work locally with a deterministic tool (pymupdf4llm uses MuPDF's actual layout engine to detect tables), and you're only asking Claude to do what it's genuinely good at: understanding structured text and returning JSON.

Two jobs, two tools, each doing what it's best at. This is the same pattern as "don't ask the model to do math, give it a calculator" or "don't ask the model to remember, give it a database."

When pymupdf4llm is NOT enough

  • Scanned PDFs (image-only) — there's no text layer to extract. You need OCR first (Tesseract, or Claude's vision on the page image).
  • Heavily designed marketing PDFs — multi-column flowing layouts confuse table detection. Less common in B2B documents.
  • Forms with checkboxes/signatures — Markdown can't represent these. You need a layout-aware model.
  • Handwritten content — same as scanned, OCR territory.

For 95%+ of real B2B document workflows — supplier invoices, bank statements, lab reports, shipping manifests, purchase orders, utility bills — the source PDFs have a real text layer and pymupdf4llm handles them cleanly.

A production-ready version

The eight-line version is enough to prove the point. The version I actually run in production has a few more guardrails — worth showing because they're cheap to add and they catch the 3% edge cases instead of letting them through silently.

import json
import pymupdf4llm
from anthropic import Anthropic

client = Anthropic()

SYSTEM = """Extract invoice line items as JSON with this schema:
{
  "supplier": str, "invoice_number": str, "date": "YYYY-MM-DD",
  "lines": [{"description": str, "qty": float,
             "unit_price": float, "vat_rate": float, "total": float}],
  "subtotal": float, "vat_total": float, "grand_total": float,
  "confidence": "high" | "medium" | "low",
  "warnings": [str]
}
If any value is ambiguous or missing, set confidence to "low" and
list the issue in warnings. Never invent values."""

def extract(pdf_path: str) -> dict:
    md = pymupdf4llm.to_markdown(pdf_path)
    if len(md.strip()) < 50:
        return {"error": "no text layer — needs OCR"}

    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2000,
        system=SYSTEM,
        messages=[{"role": "user", "content": md}],
    )
    data = json.loads(resp.content[0].text)

    # Deterministic sanity check — Claude doesn't do the math, we do
    computed = round(sum(l["total"] for l in data["lines"]), 2)
    if abs(computed - data["grand_total"]) > 0.02:
        data["confidence"] = "low"
        data.setdefault("warnings", []).append(
            f"line sum {computed} != grand_total {data['grand_total']}")
    return data

Two extra things worth noticing:

  1. Empty-text guard — if pymupdf4llm returns almost nothing, the PDF has no text layer. Route it to an OCR pipeline instead of pretending it worked.
  2. Deterministic total check — never trust the model's arithmetic. Sum the lines in Python and compare. If it doesn't match, flag the document for human review.

That last check is what turns a 97% accuracy pipeline into a 100% safe pipeline. The 3% that are wrong get caught and routed to a human, instead of getting silently booked.

Why bizflowai.io helps with this

Document pipelines like this are exactly what we build for clients at bizflowai.io — invoice ingestion into accounting software, supplier statement reconciliation, bank statement parsing into categorized transactions. The pattern is always the same: do the structural work with deterministic tools locally, then hand clean Markdown or JSON to the model for the part that genuinely needs language understanding. That's how you get production-grade accuracy on document workflows without burning your API budget or trusting a confident-sounding hallucination.

Frequently asked questions

Why does sending PDFs directly to Claude cause extraction errors?

A PDF doesn't contain real tables. It stores text at x-y coordinates on a page, so Claude must guess where rows and columns begin based on spacing. On tight invoices with merged VAT cells or totals flush against line items, it guesses wrong silently. In a test of 200 invoices, raw PDF upload was only 71% fully correct, with merged rows, dropped columns, and miscalculated totals.

How do I extract invoice line items accurately with Claude?

Convert the PDF to Markdown locally before sending it to Claude. Use the Python library pymupdf4llm: import it, call to_markdown on the PDF path, and send the resulting string to Claude instead of the file. Markdown preserves table structure with pipes and dashes, which Claude reads natively. This took accuracy from 71% to 97% on 200 real invoices using the same model and prompt.

What is pymupdf4llm and why use it for document pipelines?

pymupdf4llm is a Python library that converts PDFs to Markdown locally, preserving table semantics with pipes for columns and one row per line. It runs in about 200 milliseconds per invoice on a normal laptop, requires no OCR, no layout model, and no paid API. It handles the deterministic layout work so Claude can focus on understanding structured text and returning JSON.

How much does converting PDFs to Markdown reduce Claude token costs?

Token cost drops 6.4 times. PDFs are uploaded as images internally, which consume expensive vision tokens, while Markdown is plain text using cheap text tokens. For a workflow processing 200 invoices per week, this is roughly the difference between a $40 month and a $6 month, on top of the accuracy improvement from 71% to 97%.

When should I preprocess documents instead of sending them directly to an LLM?

Preprocess whenever your workflow involves structured tabular data, including invoices, bank statements, lab reports, or shipping manifests. Asking a model to do layout analysis and extraction simultaneously causes silent errors on tight tables or merged cells. Splitting the job, using a deterministic tool like pymupdf4llm for layout and the LLM only for understanding structured text, produces dramatically higher accuracy and lower cost.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

Why does sending PDFs directly to Claude cause extraction errors?

A PDF doesn't contain real tables. It stores text at x-y coordinates on a page, so Claude must guess where rows and columns begin based on spacing. On tight invoices with merged VAT cells or totals flush against line items, it guesses wrong silently. In a test of 200 invoices, raw PDF upload was only 71% fully correct, with merged rows, dropped columns, and miscalculated totals.

How do I extract invoice line items accurately with Claude?

Convert the PDF to Markdown locally before sending it to Claude. Use the Python library pymupdf4llm: import it, call to_markdown on the PDF path, and send the resulting string to Claude instead of the file. Markdown preserves table structure with pipes and dashes, which Claude reads natively. This took accuracy from 71% to 97% on 200 real invoices using the same model and prompt.

What is pymupdf4llm and why use it for document pipelines?

pymupdf4llm is a Python library that converts PDFs to Markdown locally, preserving table semantics with pipes for columns and one row per line. It runs in about 200 milliseconds per invoice on a normal laptop, requires no OCR, no layout model, and no paid API. It handles the deterministic layout work so Claude can focus on understanding structured text and returning JSON.

How much does converting PDFs to Markdown reduce Claude token costs?

Token cost drops 6.4 times. PDFs are uploaded as images internally, which consume expensive vision tokens, while Markdown is plain text using cheap text tokens. For a workflow processing 200 invoices per week, this is roughly the difference between a $40 month and a $6 month, on top of the accuracy improvement from 71% to 97%.

When should I preprocess documents instead of sending them directly to an LLM?

Preprocess whenever your workflow involves structured tabular data, including invoices, bank statements, lab reports, or shipping manifests. Asking a model to do layout analysis and extraction simultaneously causes silent errors on tight tables or merged cells. Splitting the job, using a deterministic tool like pymupdf4llm for layout and the LLM only for understanding structured text, produces dramatically higher accuracy and lower cost.