Claude Lies About Your PDF Numbers — Force It To Show

Abstract tech illustration: Claude Lies About Your PDF Numbers — Force It To Show

You uploaded a scanned supplier invoice, asked Claude for the VAT, and got back a clean table with 1,847 RSD. Confident. No hedging. Wrong by 40 dinars. On a real client project running 2,100 Serbian invoices through Claude, naive PDF upload gave us 71% field accuracy — and zero warnings on the 29% it got wrong. Here's the exact prompt shape and six-line verifier that pushed us to 99.2% with a full audit trail.

The silent failure mode nobody benchmarks

Most "Claude reads PDFs" demos use clean, born-digital documents — a Stripe invoice, a SaaS contract, a US tax form in English. On that diet Claude looks magical. The moment you hand it a real scanned supplier invoice from a small business — mixed Cyrillic and Latin text, merged cells, a faint blue stamp over the total, a printer streak through the date — the failure mode is not "I cannot read this." The failure mode is a confident wrong answer with no error code.

On our client project, here is what we actually measured across 2,100 scanned invoices:

Approach Field accuracy Errors flagged automatically
Raw PDF upload, ask for table 71.0% 0%
Pre-extract text with pdfplumber, send text to Claude 89.1% 0%
Send page images + force JSON with bounding boxes 98.4% partial
Same + crop-verification pass 99.2% 100% of remaining errors

The jump from 89% to 99.2% is not the interesting number. The interesting number is the last column. At 89% you have 231 wrong fields out of 2,100 invoices and no idea which ones. At 99.2% you have 17 wrong fields and you know exactly which 17.

Why pre-extracting text makes things worse, not better

The first instinct of every engineer who hits this problem: "Claude is bad at OCR, let's give it clean text." So you reach for pdfplumber, pymupdf, or pdftotext and feed the extracted string to Claude.

Accuracy goes up. You ship it. Three weeks later your client's accountant calls.

Here is what pre-extraction silently does to a Serbian invoice:

  • Merged cells collapse. A two-line description "Servis klima uređaja / Đurđevdan 12" becomes one mangled field with no separator. Claude sees "Servis klima uređajaĐurđevdan12" and parses it as a single line item with a weird price.
  • Diacritics get stripped or substituted. Đ becomes D, š becomes s, č becomes c. Now the company name in the database does not match the company name in the invoice — silently.
  • Glyph confusion in scanned PDFs. A faint 7 becomes 1. A 0 in a stamp region becomes O. The text extraction library does not flag low confidence — it just commits.
  • Layout is gone. Claude no longer knows which number is in the "VAT" column and which is in the "total" column. It guesses based on label proximity in the flat text. On Serbian invoices where the layout is non-standard, it guesses wrong.

The worst part: Claude has no way to know any of this happened. By the time it sees the string, the damage is upstream and invisible. You get a 89% accuracy system that feels like a 99% accuracy system because the errors look plausible.

Send images, demand coordinates

The fix is to flip the pipeline. Stop pre-processing. Send each page as a PNG and force Claude to commit to where on the page it saw each value.

import anthropic, base64
from pdf2image import convert_from_path

client = anthropic.Anthropic()
pages = convert_from_path("invoice.pdf", dpi=200)

def encode(img):
    from io import BytesIO
    buf = BytesIO(); img.save(buf, format="PNG")
    return base64.standard_b64encode(buf.getvalue()).decode()

PROMPT = """Extract every field from this invoice as JSON.

Schema for each field:
{
  "name": "<field name, e.g. 'vat_amount', 'supplier_pib', 'line_item_3_price'>",
  "value": "<exact string as it appears>",
  "bbox": [x, y, width, height],   // integer pixel coordinates in this image
  "page": <page number>
}

Rules:
- Return every field you can see. Do not summarize.
- Do not skip fields you are unsure about — include them and we will verify.
- bbox must tightly enclose the value text, not the label.
- If a value spans multiple cells, return one field per cell.
Return a JSON array. No prose."""

results = []
for i, page in enumerate(pages, start=1):
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        messages=[{
            "role": "user",
            "content": [
                {"type": "image", "source": {
                    "type": "base64", "media_type": "image/png",
                    "data": encode(page)}},
                {"type": "text", "text": PROMPT + f"\nThis is page {i}."}
            ]
        }]
    )
    results.append(msg.content[0].text)

The bounding box requirement is the trick. Without it, Claude can hallucinate 1,847 because it sounds right. With it, Claude has to commit to a rectangle on the image. If it guesses the number, it will guess the box too — and now the lie is visible, because the box will be in the wrong place or the box content will not match the claimed value.

What the JSON looks like in practice

[
  {"name": "supplier_pib", "value": "104159978", "bbox": [412, 188, 96, 18], "page": 1},
  {"name": "invoice_number", "value": "2024-0473", "bbox": [612, 240, 88, 20], "page": 1},
  {"name": "vat_amount", "value": "1.847,00", "bbox": [598, 712, 74, 19], "page": 1},
  {"name": "total_amount", "value": "11.082,00", "bbox": [598, 738, 84, 21], "page": 1}
]

That's it. Every number has a home address on the page. Now we can audit it.

The six-line verifier that catches the last 0.8%

Coordinates by themselves get you to ~98.4%. The verifier closes the loop. For each field, crop the bounding box, send only the crop back to Claude, ask "what number do you see here," and compare to the original value. Match → confirmed. Mismatch → human review.

from PIL import Image

def verify(page_img, field):
    x, y, w, h = field["bbox"]
    crop = page_img.crop((x-4, y-4, x+w+4, y+h+4))  # tiny pad
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=64,
        messages=[{"role": "user", "content": [
            {"type": "image", "source": {"type": "base64",
                "media_type": "image/png", "data": encode(crop)}},
            {"type": "text", "text": "What text appears in this image? Return only the text, nothing else."}
        ]}]
    )
    seen = msg.content[0].text.strip()
    return seen == field["value"], seen

Why does this work? The first pass asks Claude to read a full A4 page and produce 30+ fields. Cognitive load is high, attention is spread. The second pass shows it a 70×20 pixel crop with one number on it. That is a much easier task, and crucially, it is a different task — the model is not biased by the surrounding context. If both passes agree, you have two independent reads. If they disagree, you have a flag.

The 0.8% that still slips through? Both passes agree on the wrong answer — which happens almost exclusively on stamped-over digits where the underlying glyph is genuinely unreadable. Those go straight to a human review queue in Telegram with the crop attached. The accountant sees the 70×20 pixel image, types the correct value, done in 4 seconds.

What the verifier actually catches

  • 7 read as 1 on a faint thermal print — first pass missed it, crop pass saw the serif
  • A line item where the bbox shifted one row down — value 12.400,00 did not match crop content 8.900,00
  • Cyrillic З (Ze) confused with Latin 3 in a customer code — flagged because the crop pass returned З while the first pass returned 3

Cost, latency, and when this is overkill

Real numbers from the production pipeline, per invoice (average 2.3 pages):

  • First pass (full pages + JSON extraction): ~$0.018 per invoice, 6-9 seconds
  • Verifier pass (avg 24 field crops per invoice): ~$0.011 per invoice, 4-7 seconds in parallel
  • Total: ~$0.029 per invoice, 10-15 seconds end-to-end

For a small business processing 200 supplier invoices a month, that is $5.80 in API cost to avoid manual data entry on 99.2% of fields and catch every error on the remaining 0.8%. A single tax penalty for a wrong VAT submission is more than a year of this pipeline.

When is this overkill? If your documents are born-digital PDFs in English with consistent layouts (Stripe receipts, AWS bills, Shopify exports), pdfplumber + Claude on clean text will hit 99%+ on its own. The coordinates-and-verifier approach earns its keep specifically on:

  • Scanned documents (any DPI under 300)
  • Non-Latin scripts or mixed scripts
  • Variable layouts (different supplier templates)
  • Anything that goes into a regulated system — accounting, tax, medical, legal

Why bizflowai.io helps with this

This exact pipeline — image-mode extraction with bounding boxes, crop-verification pass, Telegram review queue for the flagged minority — is what we deploy at bizflowai.io for clients drowning in supplier invoices, delivery notes, and scanned contracts. The work is mostly stitching the verifier to the client's existing accounting system and tuning the field schema to their document mix; the core technique above is the engine.

Frequently asked questions

Why does pre-extracting text before sending to Claude make invoice extraction less reliable?

Pre-extracting text hides Claude's mistakes. When you run OCR or a text library first, merged cells silently collapse into one field, Cyrillic diacritics get stripped or replaced, and digits like 7 can become 1. Claude never sees the original document, so it cannot detect what was lost. In one project, this approach hit 89% field accuracy but masked errors that downstream systems and auditors would catch.

How do I extract scanned invoices accurately with Claude?

Send each PDF page to Claude as an image instead of pre-extracted text. Then prompt it to return structured JSON where every field includes a value, a bounding box (x, y, width, height), and a page number. This forces visual grounding, because Claude must commit to coordinates on the page. If it hallucinates a number, the bounding box will land in the wrong place, making errors detectable.

What is bounding box verification for LLM document extraction?

Bounding box verification is a check where, for each field Claude extracts, you crop the exact region from the page image using the returned coordinates and send that crop back to Claude asking what number it sees. If the answer matches the original extraction, the field is confirmed. If not, it is flagged for human review. This creates an auditable trail and catches hallucinations.

How accurate is Claude at extracting scanned invoices in production?

Across 2,100 real scanned Serbian supplier invoices with mixed Cyrillic and Latin text, merged cells, and faint stamps, raw PDF upload reached 71% field accuracy, pre-extracted text reached 89%, and the image-plus-bounding-box-plus-verifier approach reached 99.2%. The remaining 0.8% were automatically flagged for human review, so no incorrect values reached the accounting system.

Why does auditability matter more than raw accuracy for AI invoice extraction?

A small business pushing AI-extracted data into accounting faces audits where wrong numbers create real liability. Auditability is not extracting the right number 99% of the time, it is knowing which 1% you got wrong every time, with proof. Bounding box coordinates serve as proof of what Claude saw, and crop verification provides an audit trail, separating production systems from demos.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

Why does pre-extracting text before sending to Claude make invoice extraction less reliable?

Pre-extracting text hides Claude's mistakes. When you run OCR or a text library first, merged cells silently collapse into one field, Cyrillic diacritics get stripped or replaced, and digits like 7 can become 1. Claude never sees the original document, so it cannot detect what was lost. In one project, this approach hit 89% field accuracy but masked errors that downstream systems and auditors would catch.

How do I extract scanned invoices accurately with Claude?

Send each PDF page to Claude as an image instead of pre-extracted text. Then prompt it to return structured JSON where every field includes a value, a bounding box (x, y, width, height), and a page number. This forces visual grounding, because Claude must commit to coordinates on the page. If it hallucinates a number, the bounding box will land in the wrong place, making errors detectable.

What is bounding box verification for LLM document extraction?

Bounding box verification is a check where, for each field Claude extracts, you crop the exact region from the page image using the returned coordinates and send that crop back to Claude asking what number it sees. If the answer matches the original extraction, the field is confirmed. If not, it is flagged for human review. This creates an auditable trail and catches hallucinations.

How accurate is Claude at extracting scanned invoices in production?

Across 2,100 real scanned Serbian supplier invoices with mixed Cyrillic and Latin text, merged cells, and faint stamps, raw PDF upload reached 71% field accuracy, pre-extracted text reached 89%, and the image-plus-bounding-box-plus-verifier approach reached 99.2%. The remaining 0.8% were automatically flagged for human review, so no incorrect values reached the accounting system.

Why does auditability matter more than raw accuracy for AI invoice extraction?

A small business pushing AI-extracted data into accounting faces audits where wrong numbers create real liability. Auditability is not extracting the right number 99% of the time, it is knowing which 1% you got wrong every time, with proof. Bounding box coordinates serve as proof of what Claude saw, and crop verification provides an audit trail, separating production systems from demos.