I Ran 600 Scanned Invoices Through Claude PDF Upload — 29%

A small invoicing business I work with gets 200-400 supplier PDFs a week. Scanned, rotated, stamped, mixed Cyrillic and Latin script. The owner wanted one thing: structured JSON in a Postgres table for reconciliation. The "drag PDF into Claude" approach lost 29% of fields on real input. Here's the two-stage pipeline that pushed accuracy to 97% and cut the token bill 77%.
What "29% wrong" actually looks like
I ran 600 real supplier invoices through Claude's native PDF upload with a strict JSON schema prompt. The output looked fantastic. That's the problem.
On one Cyrillic-header invoice with two stamps and a slight rotation, Claude returned:
{
"supplier": "Дунав Доо",
"invoice_no": "2024-1187",
"vat_rate": 10,
"subtotal": 48250,
"total": 48250,
"line_items": [
{ "desc": "Транспорт + утовар", "qty": 1, "price": 12500 }
]
}
Three things were wrong and none of them were obvious:
- The real VAT rate was 20%, not 10%. Claude filled the gap.
- The
totalfield is actually the subtotal. The real total was 57,900 RSD. - The line item "Транспорт + утовар" is two merged rows — "Транспорт" and "Утовар" were separate entries on the original.
Across the 600-invoice run:
- 71% fully correct
- 22% had one field wrong (usually VAT or total)
- 7% had structural errors (merged rows, missing items)
For reconciliation, anything below 99% is unusable. You can't push hallucinated VAT into an accounting ledger.
Failure modes I saw repeatedly
- VAT rate invented when the stamp covered the percentage
- Subtotal returned as total when the total line was below a horizontal rule
- Two-line item descriptions merged into one row
- Date parsed as US format when the document was DD.MM.YYYY
Why native PDF upload breaks on scans
When Claude receives a born-digital PDF, it has access to the text layer — character positions, font metrics, table cell boundaries. When it receives a scanned PDF, none of that exists. It's looking at a rasterized image at whatever DPI the scanner produced, and the multimodal layer downsamples it further.
The model can read the page, but column alignment, row boundaries, and stamp-overlapping numbers are reconstructed from visual guesswork. Under uncertainty, LLMs produce the most plausible token. For invoices, plausible-but-wrong is the worst failure mode you can have, because it passes review.
Tutorials never hit this because they test on clean English research papers or Stripe receipts. Real supplier invoices are not that.
The two-stage pipeline
Rule one: don't let Claude see the PDF. Convert it to text first, then to structured markdown, then ask Claude to extract fields.
PDF
→ [Stage 1] OCR (Tesseract or Mistral OCR)
→ raw text + positional hints
→ [Stage 2] Haiku: clean into structured markdown
→ headers, tables, key-value blocks
→ [Stage 3] Claude Sonnet: extract JSON per schema
→ Postgres + Telegram alert if confidence < threshold
Stage 1 is OCR. I route born-digital scans (single-language, clean rotation) through Tesseract because it's free and fast. Anything multi-script, rotated, or stamped goes to Mistral OCR, which handles rotation correction and overlapping marks far better.
# stage 1: OCR routing
def ocr_route(pdf_path: str) -> str:
meta = inspect_pdf(pdf_path) # detects scan vs born-digital, script, rotation
if meta.is_born_digital and meta.script == "latin":
return tesseract_extract(pdf_path, lang="eng+srp_latn")
return mistral_ocr(pdf_path, rotation_correct=True, multi_script=True)
Stage 2 is the cheap structuring call. The job here is not to understand the invoice. It's to turn messy OCR text into well-formed markdown. Haiku does this for fractions of a cent per page.
STRUCTURE_PROMPT = """
You are a text formatter. Convert the OCR output below into clean markdown.
- Document metadata as a key-value block at the top
- Line items as a markdown table with columns: description, qty, unit_price, total
- Totals (subtotal, VAT, grand total) as a labeled key-value block at the bottom
Do NOT correct numbers. Do NOT infer missing fields. Only restructure what's there.
OCR TEXT:
{ocr_text}
"""
Stage 3 is the real extraction. Now Claude is reading structured markdown with explicit table boundaries and labeled totals. It stops hallucinating because there's nothing to guess.
EXTRACT_PROMPT = """
Extract the following fields as JSON matching this schema:
{schema}
Rules:
- If a field is not clearly present, set it to null and add it to "ambiguous_fields"
- VAT rate must be an integer percentage from the document, never inferred
- Return overall confidence 0-1 based on field clarity
MARKDOWN INVOICE:
{structured_md}
"""
Real numbers from the 600-invoice rerun
Same 600 invoices, same schema, same Postgres target. Native upload vs two-stage pipeline:
| Metric | Native PDF upload | Two-stage pipeline |
|---|---|---|
| Field accuracy | 71% | 97% |
| Avg tokens per invoice | ~14,000 | ~3,200 |
| Avg processing time | 11.2s | 4.1s |
| Cost per 1,000 invoices | ~$38 | ~$8.70 |
The token drop is the biggest cost lever — you're not shipping image bytes through the API anymore, and the Haiku structuring step is roughly 10× cheaper than Sonnet for the same input length.
The remaining 3% are genuine edge cases: handwritten notes over numbers, severely degraded scans, missing pages. These get a Telegram ping with the PDF and OCR output attached, and a human handles them in 30 seconds. Don't try to automate the long tail.
What I tuned along the way
- Tesseract language pack:
eng+srp_latn+srp_cyrlcut OCR errors on mixed-script headers by ~40% - Confidence threshold for human routing: 0.82 (anything below pings Telegram)
- Schema-level VAT validation: reject any rate that isn't in
{0, 10, 20}for Serbian invoices
The n8n flow
The whole thing runs nightly in n8n on my home server. Five nodes:
[Webhook / Folder trigger]
↓
[Python: OCR routing]
↓
[Python: Haiku structuring call]
↓
[HTTP: Claude Sonnet extraction]
↓
[Postgres insert] ──┬── if confidence < 0.82 → [Telegram alert]
Two Python nodes do the heavy lifting (OCR + structuring), one HTTP node hits the Anthropic API, one Postgres node writes the row, one Telegram node handles edge cases. No Kubernetes, no queue, no Lambda. It processes 300 invoices in about 22 minutes.
When native PDF upload is still the right call
I'm not telling you to never use Claude's PDF upload. If your input is:
- Born-digital (exported from accounting software, not scanned)
- Single page
- English, simple layout
- Stripe, Amazon, Shopify-style receipts
Then native upload works fine and the two-stage pipeline is over-engineering. Don't build infrastructure you don't need.
But the moment a human had to physically touch the document — print it, sign it, stamp it, photograph it — the pipeline isn't optional. It's the difference between a demo that wows on Twitter and a production system that an accountant trusts on Monday morning.
Why bizflowai.io helps with this
This pipeline is one of the document-processing flows we deploy for clients at bizflowai.io — the OCR routing, the cheap structuring pass, the schema-validated extraction, and the human-in-the-loop fallback for the 3% that shouldn't be automated. Most small businesses drowning in supplier PDFs don't need a custom OCR vendor or a six-figure document AI platform; they need exactly this shape of pipeline wired into the tools they already use.
Frequently asked questions
Why does Claude's native PDF upload fail on scanned invoices?
When Claude's native PDF upload processes a scanned page, it treats it as a low-fidelity image, losing text positioning, table structure, and column alignment before reasoning begins. Under uncertainty, the model fills gaps with plausible numbers — hallucinating VAT rates, merging line items, or returning subtotals as totals. Across 600 test invoices, 29% had at least one wrong field, making the output unusable for accounting reconciliation.
How do I extract structured data from scanned PDF invoices reliably?
Use a two-stage pipeline. Stage one: run the PDF through real OCR first — Tesseract for born-digital scans, Mistral OCR for rotated, multi-script, or stamped documents — producing raw text with positional hints. Stage two: use a cheap LLM like Claude Haiku to convert that text into structured markdown (headers, tables, key-value blocks). Only then pass the markdown to Claude for JSON extraction against your schema.
When should I use native PDF upload vs a two-stage OCR pipeline?
Native PDF upload works fine when documents are born-digital, single-page, English, and simply laid out — think Stripe invoices or Amazon receipts. Don't over-engineer those. But for scans, photos, mixed scripts, rotated pages, stamps, or anything a human physically handled, the two-stage OCR-then-structure pipeline is required. It's the difference between a demo and a production system that won't corrupt your accounting data.
What accuracy and cost improvements does the two-stage pipeline deliver?
On the same 600 supplier invoices, field accuracy jumped from 71% to 97% after switching from native PDF upload to the OCR-plus-markdown pipeline. Tokens per invoice dropped from roughly 14,000 to 3,200 because image data is no longer shipped to Claude. Processing time fell from 11 seconds to 4 seconds per invoice, while edge cases below a confidence threshold get routed to a human.
How do I handle invoice extraction edge cases in production?
Don't try to automate everything. Build the pipeline in a tool like n8n with two Python nodes (OCR and structuring), the Claude extraction call, then a write to Postgres. Set a confidence threshold on Claude's output — when a record falls below it, send a Telegram alert and route that invoice to a human reviewer. Roughly 3% of real-world invoices need this manual fallback.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why does Claude's native PDF upload fail on scanned invoices?
When Claude's native PDF upload processes a scanned page, it treats it as a low-fidelity image, losing text positioning, table structure, and column alignment before reasoning begins. Under uncertainty, the model fills gaps with plausible numbers — hallucinating VAT rates, merging line items, or returning subtotals as totals. Across 600 test invoices, 29% had at least one wrong field, making the output unusable for accounting reconciliation.
How do I extract structured data from scanned PDF invoices reliably?
Use a two-stage pipeline. Stage one: run the PDF through real OCR first — Tesseract for born-digital scans, Mistral OCR for rotated, multi-script, or stamped documents — producing raw text with positional hints. Stage two: use a cheap LLM like Claude Haiku to convert that text into structured markdown (headers, tables, key-value blocks). Only then pass the markdown to Claude for JSON extraction against your schema.
When should I use native PDF upload vs a two-stage OCR pipeline?
Native PDF upload works fine when documents are born-digital, single-page, English, and simply laid out — think Stripe invoices or Amazon receipts. Don't over-engineer those. But for scans, photos, mixed scripts, rotated pages, stamps, or anything a human physically handled, the two-stage OCR-then-structure pipeline is required. It's the difference between a demo and a production system that won't corrupt your accounting data.
What accuracy and cost improvements does the two-stage pipeline deliver?
On the same 600 supplier invoices, field accuracy jumped from 71% to 97% after switching from native PDF upload to the OCR-plus-markdown pipeline. Tokens per invoice dropped from roughly 14,000 to 3,200 because image data is no longer shipped to Claude. Processing time fell from 11 seconds to 4 seconds per invoice, while edge cases below a confidence threshold get routed to a human.
How do I handle invoice extraction edge cases in production?
Don't try to automate everything. Build the pipeline in a tool like n8n with two Python nodes (OCR and structuring), the Claude extraction call, then a write to Postgres. Set a confidence threshold on Claude's output — when a record falls below it, send a Telegram alert and route that invoice to a human reviewer. Roughly 3% of real-world invoices need this manual fallback.