Why Claude Misreads PDF Invoices — And the Fix That Hits 94%

Distorted PDF invoice fragments reassembling into clean structured data, showing Claude's 94% accuracy fix

If you're converting invoice PDFs to markdown before sending them to Claude, you're quietly destroying your own accuracy. I tested it on 200 real supplier invoices across three methods, and the answer every blog post on the internet recommends came in second. Here's the method that hit 94% field accuracy, the Python that powers it, and why "convert to markdown first" is bad advice the moment a table shows up.

The accounting firm that couldn't trust their own automation

A small accounting firm pinged me with a familiar problem. 400 supplier invoices a week, every layout different — scanned faxes, multi-column PDFs, Cyrillic line items, VAT codes hiding in footers. Their team was hand-typing fields into the accounting system and burning two full days a week on it.

They'd already tried the obvious things. Upload the PDF directly to ChatGPT. Upload it to Claude. Run the popular "convert to markdown first" pipeline. Everything worked on the demo invoices and broke on the production set. Claude hallucinated VAT numbers. Totals came back off by one digit. Supplier names landed in the address field.

Once you can't trust a single field, you have to verify all of them. At that point you've automated nothing — you've just added a review step on top of the typing.

So I ran a controlled test. Same 200 invoices, three methods, every extracted field scored against the ground truth: supplier name, invoice number, date, line items, VAT, total.

The numbers: raw PDF vs markdown vs image-per-page

I'll skip the suspense. Here are the field-level accuracy results across the same 200-invoice corpus:

  • Method 1 — Raw PDF upload to Claude with an extraction prompt: 71%. Fast and cheap. Falls apart on multi-column layouts and any scanned invoice.
  • Method 2 — pdfplumber → markdown → Claude (the popular advice): 78%. Better on text-heavy invoices. Noticeably worse on anything with a real table.
  • Method 3 — Render each PDF page as PNG at 200 DPI, send as image with a strict JSON schema: 94%.

Method 3 costs roughly 2x the tokens of Method 2. But you stop re-running failures and you stop manually patching fields. On 400 invoices a week, the math pays for itself inside the first hundred documents.

The interesting result isn't that images win. It's that markdown conversion only beat raw PDF by 7 points, and on table-heavy invoices it was actually worse than just shipping the PDF unmodified. The popular advice is wrong for this specific document type.

Why markdown conversion destroys invoices specifically

Markdown conversion is genuinely good advice for prose. Contracts, reports, memos, internal wikis — strip the layout noise, hand Claude clean text, get better answers.

Invoices are not prose. Invoices are geometry.

The meaning of a number on an invoice depends entirely on which column it sits in and which row aligns with which label. 127.50 is the unit price if it's in column 3. It's the line total if it's in column 5. It's the subtotal if it's in the footer block. The column position is the semantics.

When pdfplumber or markitdown flattens a table to markdown, here's what happens:

  • Columns collapse into space-separated tokens on a single line
  • Multi-row line items get reordered based on PDF draw order, not visual order
  • Footer totals end up adjacent to header metadata
  • Scanned invoices (which have no extractable text layer) return empty or garbage

The model is now being asked to reconstruct table geometry from a wall of text. It guesses. It guesses wrong often enough to break your pipeline.

Image-per-page sidesteps the whole problem. Claude's vision model sees the document the way a bookkeeper sees it: columns stay columns, the VAT row stays next to its label, stamps and signatures and weird logos all stay in place as visual context.

The 15-line rasterizer that does the actual work

Here's the core of the extraction. Rasterize each page at 200 DPI, base64-encode it, attach as an image content block. That resolution number matters — lower and small fonts blur into noise, higher and you burn tokens for no accuracy gain.

import base64, io
import fitz  # pymupdf
from anthropic import Anthropic

client = Anthropic()

def pdf_to_image_blocks(pdf_path: str, dpi: int = 200):
    doc = fitz.open(pdf_path)
    blocks = []
    for page in doc:
        pix = page.get_pixmap(dpi=dpi)
        png_bytes = pix.tobytes("png")
        b64 = base64.standard_b64encode(png_bytes).decode("utf-8")
        blocks.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": b64,
            },
        })
    return blocks

That's it for the ingestion side. pymupdf is faster and has fewer system dependencies than pdf2image (no poppler), but either works. For a 2-page invoice at 200 DPI you're looking at roughly 300-500 KB of PNG per page, which is well inside Claude's per-request image limits.

The system prompt that locks the schema

The model call is where most people leave accuracy on the table. The prompt has to do three things: define a strict JSON schema, forbid commentary, and explicitly allow null for missing fields.

SYSTEM = """You extract structured data from supplier invoices.
Return ONLY valid JSON matching this schema, with no commentary:

{
  "supplier_name": string,
  "invoice_number": string,
  "issue_date": string (ISO 8601, YYYY-MM-DD),
  "currency": string (ISO 4217),
  "line_items": [
    {
      "description": string,
      "quantity": number,
      "unit_price": number,
      "line_total": number
    }
  ],
  "subtotal": number,
  "vat_amount": number,
  "vat_rate": number,
  "total": number
}

Rules:
- If a field is not present or not legible, return null. Do not guess.
- Numbers must be numbers, not strings. No currency symbols.
- Dates must be ISO 8601. If only month/year are visible, return null.
"""

resp = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=2000,
    system=SYSTEM,
    messages=[{
        "role": "user",
        "content": pdf_to_image_blocks("invoice.pdf") + [
            {"type": "text", "text": "Extract this invoice."}
        ],
    }],
)

The single most important line in that prompt is "If a field is not present or not legible, return null. Do not guess." That one instruction cut hallucinations dramatically in my testing. Without it, the model invents plausible-looking values to fill the schema — fake VAT numbers, made-up invoice numbers, dates pulled from neighbouring text. With it, missing fields surface as null and your validator can route them for human review.

A few small things that mattered:

  • No "be careful" or "be accurate" — those don't do anything. The null instruction does.
  • Numbers as numbers, not strings. Cuts downstream parsing errors in half.
  • ISO dates only. Otherwise you'll get 15.03.2024 from European invoices and 03/15/2024 from US ones in the same field.

Wrap it in a validator or it's not a system

Demos don't need this. Production does. Parse the response with Pydantic, and route anything that fails validation or has a null in a required field to a human queue:

from pydantic import BaseModel, ValidationError
from typing import Optional
import json

class LineItem(BaseModel):
    description: str
    quantity: float
    unit_price: float
    line_total: float

class Invoice(BaseModel):
    supplier_name: str
    invoice_number: str
    issue_date: str
    currency: str
    line_items: list[LineItem]
    subtotal: Optional[float]
    vat_amount: Optional[float]
    vat_rate: Optional[float]
    total: float

try:
    data = json.loads(resp.content[0].text)
    invoice = Invoice(**data)
    push_to_accounting(invoice)
except (ValidationError, json.JSONDecodeError) as e:
    flag_for_review(pdf_path, reason=str(e))

Add one more cross-check: sum the line_total values and compare against the extracted subtotal within a small tolerance (rounding, discount lines). If they disagree by more than, say, 1%, flag for review. This is the cheapest accuracy multiplier you'll ever ship — it catches column-confusion errors that the model can't catch itself.

The accounting firm I built this for went from two days a week of manual entry to an overnight queue that surfaces roughly 10 exceptions per morning out of 80-ish invoices. Their team reviews exceptions, bulk-approves the rest, and the system writes straight into the accounting software. That's the gap between a tutorial trick and something you can leave running while you sleep.

Why bizflowai.io helps with this

This invoice pipeline is one of the standard workflows we deploy at bizflowai.io for accounting firms and SMBs drowning in supplier paperwork. The same image-per-page extraction, schema validation, and exception-queue pattern gets wired into the client's existing accounting software (most commonly QuickBooks, Xero, or local ERPs in the EU market), so processed invoices land as draft entries ready for approval rather than as JSON sitting in a folder. The interesting work is usually not the extraction itself — it's the per-supplier quirks, the VAT rules per jurisdiction, and the routing logic for exceptions.

Frequently asked questions

What is the most accurate way to extract data from PDF invoices with Claude?

Rendering each PDF page as a PNG image at 200 DPI and sending it to Claude's vision model with a strict JSON schema achieves around 94% field accuracy. This outperforms uploading raw PDFs directly (71%) and converting PDFs to markdown first (78%). Image-per-page preserves the table geometry, columns, and label alignment that invoices depend on, which flattened text loses.

Why does markdown conversion fail for invoice extraction?

Invoices are geometry, not prose. The meaning of a number depends on which column and row it sits in. When you flatten a table to markdown using tools like pdfplumber or markitdown, columns collapse into a wall of text and the model must guess which number is the unit price versus the line total. It guesses wrong often enough to break automation pipelines.

How do I stop Claude from hallucinating values when extracting invoice fields?

Add an explicit instruction to your system prompt: if a field is not present or not legible, return null and do not guess. Combine this with a strict JSON schema defining each expected field and its type. This single instruction dramatically cuts hallucinations because the model stops inventing values just to fill required schema slots.

When should I use image-based PDF extraction versus markdown conversion?

Use image-based extraction at 200 DPI for invoices, receipts, forms, and any document where table geometry, columns, and spatial layout carry meaning. Use markdown conversion for prose-heavy documents like contracts, reports, and memos, where stripping noise helps the model read better. The trade-off is roughly double the token cost for images, offset by avoiding manual fixes.

How do I build a production-ready invoice extraction pipeline?

Use pdf2image or pymupdf to rasterize each PDF page at 200 DPI, encode pages as base64, and attach them as image content blocks in the Claude API call. Pass a system prompt with a strict JSON schema covering supplier, invoice number, ISO date, line items, VAT, and total. Wrap the output in a Pydantic validator that flags unparseable JSON or missing required fields for human review.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

What is the most accurate way to extract data from PDF invoices with Claude?

Rendering each PDF page as a PNG image at 200 DPI and sending it to Claude's vision model with a strict JSON schema achieves around 94% field accuracy. This outperforms uploading raw PDFs directly (71%) and converting PDFs to markdown first (78%). Image-per-page preserves the table geometry, columns, and label alignment that invoices depend on, which flattened text loses.

Why does markdown conversion fail for invoice extraction?

Invoices are geometry, not prose. The meaning of a number depends on which column and row it sits in. When you flatten a table to markdown using tools like pdfplumber or markitdown, columns collapse into a wall of text and the model must guess which number is the unit price versus the line total. It guesses wrong often enough to break automation pipelines.

How do I stop Claude from hallucinating values when extracting invoice fields?

Add an explicit instruction to your system prompt: if a field is not present or not legible, return null and do not guess. Combine this with a strict JSON schema defining each expected field and its type. This single instruction dramatically cuts hallucinations because the model stops inventing values just to fill required schema slots.

When should I use image-based PDF extraction versus markdown conversion?

Use image-based extraction at 200 DPI for invoices, receipts, forms, and any document where table geometry, columns, and spatial layout carry meaning. Use markdown conversion for prose-heavy documents like contracts, reports, and memos, where stripping noise helps the model read better. The trade-off is roughly double the token cost for images, offset by avoiding manual fixes.

How do I build a production-ready invoice extraction pipeline?

Use pdf2image or pymupdf to rasterize each PDF page at 200 DPI, encode pages as base64, and attach them as image content blocks in the Claude API call. Pass a system prompt with a strict JSON schema covering supplier, invoice number, ISO date, line items, VAT, and total. Wrap the output in a Pydantic validator that flags unparseable JSON or missing required fields for human review.