The 3-Line Header That Took My Claude PDF Accuracy to 96%

By Lazar Milicevic · Published June 17, 2026 · 9 min read

Your Claude PDF extractor isn't broken because of bad OCR. It's broken because you're stripping the one piece of context the model actually needs to disambiguate page two. If you've ever watched line items come back with the wrong currency or columns merged into the description field, this post is for you.

I run this pipeline in production billing real customers. Here's the exact fix, measured on 200 real supplier invoices, with the Python that ships it.

The failure mode nobody names: page-two amnesia

Every multi-page supplier invoice has the same structure. Page one carries the metadata — invoice number, supplier, currency, exchange rate, and the column schema for the line-item table. Pages two and three carry the rows. No header. No currency marker. No column labels. Just a table that visually continues from page one.

A human reading the printed PDF never notices. The context is right there on the previous sheet of paper. But the moment you chunk that document and send each page to Claude as its own request, page two becomes a floating table with no anchor. The model is forced to guess. And it guesses badly:

Currency flips. A €1,240.00 row gets re-typed as USD because there's no symbol on page two.
Columns swap. Without the schema, Claude reorders quantity and unit_price based on which number "looks like a price."
Description bleeds. Long descriptions on page two get merged with the quantity column because the model can't see where one field ends.

This isn't a model limitation. It's a context-stripping problem you introduced at chunk time.

Why "better OCR" is the wrong instinct

I went down the standard rabbit hole first. Here's what each "fix" actually delivered on my benchmark of 200 supplier invoices (mix of German, Serbian, and Dutch suppliers, 1–7 pages each):

Approach	Line-item accuracy	Tokens / invoice	Notes
Raw PDF upload to Claude	58%	~16,000	Worst on multi-page
pdfplumber → markdown	74%	~14,000	Hard ceiling around here
Unstructured.io pipeline	75%	~15,500	Marginal, more deps
LlamaParse	76%	~14,800	Same ceiling
pdfplumber + 3-line header per page	96%	~3,800	Same model, same prompt

The accuracy ceiling sat at ~75% no matter which extractor I bolted on. That's the tell. When swapping the parser stops moving the number, the parser isn't the bottleneck. The chunk is.

What I tried before this clicked

Larger context windows (sending the whole PDF every call). Worked, but cost was brutal and latency hit 22s on 5-page invoices.
Two-pass extraction (first pass for metadata, second pass for rows). Worked at 91% but doubled the API spend.
Fine-tuning a smaller model on invoice JSON. Two weeks of work for a 4-point accuracy gain. Not worth it.

The three-line header beat all of them on accuracy, cost, and latency. Same model, same prompt.

The fix: three lines, prepended to every page

The idea is dumb-simple. Parse the header block from page one once. Then prepend a tiny anchor to every page before you send it to Claude. Three lines:

Invoice ID + supplier — gives the model a stable identity to attach rows to.
Currency + exchange rate — locked from page one, never re-inferred.
Column schema, in order — the exact field names Claude should produce per row.

Here's the actual code. No new dependencies beyond pdfplumber and the Anthropic SDK:

import pdfplumber
from anthropic import Anthropic

client = Anthropic()

def parse_header(page_text: str) -> dict:
    """Run once on page 1. Pull invoice_id, currency, column order."""
    return {
        "invoice_id": extract_invoice_id(page_text),
        "supplier":   extract_supplier(page_text),
        "currency":   extract_currency(page_text),   # "EUR"
        "fx_rate":    extract_fx_rate(page_text),    # "1.000" or "117.85 RSD"
        "columns":    extract_columns(page_text),    # ["description","qty","unit_price","line_total"]
    }

def build_anchor(meta: dict) -> str:
    return (
        f"INVOICE: {meta['invoice_id']} | SUPPLIER: {meta['supplier']}\n"
        f"CURRENCY: {meta['currency']} | FX_RATE: {meta['fx_rate']}\n"
        f"COLUMNS (in order): {', '.join(meta['columns'])}\n"
        f"---\n"
    )

def extract_invoice(pdf_path: str):
    with pdfplumber.open(pdf_path) as pdf:
        page1_text = pdf.pages[0].extract_text()
        meta = parse_header(page1_text)
        anchor = build_anchor(meta)

        rows = []
        for page in pdf.pages:
            page_text = page.extract_text()
            anchored_chunk = anchor + page_text
            rows.extend(call_claude(anchored_chunk))
        return rows

That's it. The call_claude function uses the same extraction prompt I had before. I changed nothing about the prompt, the model (Sonnet), or the JSON schema.

Why this works: it's a prompt anchor, not metadata

The header isn't there for your debugging. It's there because Claude reads it as a system-level constraint on every chunk.

When page two arrives with three lines that say "this is invoice INV-2024-1840, currency EUR, columns are description / qty / unit_price / line_total," the model stops guessing. It now knows:

Every row belongs to a single, named invoice (no cross-contamination if you batch).
Every numeric field has a fixed currency. The hallucinated USD goes away.
Every row maps to a known schema, in order. Column drift disappears.

The result on my 200-invoice benchmark:

Line-item accuracy: 74% → 96%
Tokens per invoice: ~14,000 → ~3,800 (73% drop)
End-to-end latency (3-page invoice): 14s → under 4s (with parallel page calls)

The token drop is the part most people miss. Before, I was shoving the entire document into context on every call so Claude could "see" page one when processing page two. Now each page is self-contained at ~1,200 tokens, and I can parallelize.

Parallelizing now that chunks are independent

Once every page is anchored, each page call is independent. No shared context. That means you can fan out:

import asyncio
from anthropic import AsyncAnthropic

async_client = AsyncAnthropic()

async def extract_page(anchored_chunk: str):
    resp = await async_client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2000,
        messages=[{"role": "user", "content": EXTRACT_PROMPT + anchored_chunk}],
    )
    return parse_rows(resp.content[0].text)

async def extract_invoice_parallel(pdf_path: str):
    with pdfplumber.open(pdf_path) as pdf:
        meta = parse_header(pdf.pages[0].extract_text())
        anchor = build_anchor(meta)
        chunks = [anchor + p.extract_text() for p in pdf.pages]

    # 6-way concurrency, plenty for a 3-7 page invoice
    sem = asyncio.Semaphore(6)
    async def bounded(c):
        async with sem:
            return await extract_page(c)

    results = await asyncio.gather(*(bounded(c) for c in chunks))
    return [row for page_rows in results for row in page_rows]

A 3-page invoice that used to take 14 seconds sequentially now finishes in 3.8 seconds. A 7-page invoice that timed out at 38s now lands at 6.2s. The math is straightforward: with anchored chunks, latency is dominated by your slowest page, not the sum of pages.

Watch out for these when you parallelize

Rate limits. Six concurrent page calls per invoice × multiple invoices in flight will hit your TPM cap fast. Cap concurrent invoices, not just pages.
Order matters for line-item numbering. asyncio.gather preserves order, but if you switch to as_completed, re-sort by page index before stitching.
Header parse failures. If page one OCR fails, every downstream chunk is poisoned. Validate the header dict before fanning out — better to fail fast than process 7 pages with currency: None.

When this pattern applies (and when it doesn't)

The three-line anchor isn't invoice-specific. It works any time you have:

A multi-page document where page one establishes context for pages two-N.
A model that processes pages independently (chunking, parallel calls, RAG retrieval).
Hallucinations that look like "the model forgot what it was reading."

Concrete fits I've shipped or seen:

Document type	Anchor lines
Supplier invoices	invoice_id, currency, column schema
Bank statements	account_number, currency, statement_period
Insurance claims	claim_id, policy_number, claimant
Medical lab reports	patient_id, report_date, reference_ranges
Real estate contracts	property_id, parties, governing_law

Where it doesn't help: single-page documents (no chunk boundary to cross), or documents where every page already repeats the header (most government forms). In those cases the anchor is redundant and you're just burning a few hundred tokens.

Why bizflowai.io helps with this

This pattern — chunk anchoring, schema locking, parallel page extraction — is one of the building blocks behind the document-processing pipelines we run at bizflowai.io for clients pulling line items out of supplier invoices, statements, and contracts at volume. The work isn't glamorous: it's measuring accuracy on real documents, finding the chunk boundary where context dies, and patching it with the smallest possible change. If you're processing hundreds of PDFs a month and your accuracy ceiling is stuck around 75%, the fix is almost never a different model.

The lesson

When an LLM gets a PDF wrong, the instinct is to blame the OCR engine, the model, or the prompt. Nine times out of ten, you stripped the context the model needed and didn't notice.

Three lines. Fifteen lines of Python. 74% → 96%. 73% fewer tokens. 3.5x faster.

You don't need a better extractor. You need a better chunk.

Frequently asked questions

Why does Claude misread multi-page PDF invoices?

Claude misreads multi-page invoices because header context only appears on page one. Pages two and three typically contain just rows with no currency marker, invoice ID, or column schema. When you chunk the PDF and send pages individually, the model has no anchor, so it guesses currencies, swaps columns, and merges fields like quantity into description.

How do I improve Claude's PDF extraction accuracy?

Prepend a 3-line context header to every page before sending it to Claude. Line one: invoice ID and supplier. Line two: currency and exchange rate from page one. Line three: the exact column schema in order. On 200 supplier invoices, this raised line-item accuracy from 74% (markdown conversion) to 96% using the same model and prompt.

Why doesn't switching OCR engines fix PDF extraction errors?

Switching OCR engines like Unstructured or LlamaParse hits an accuracy ceiling around 75% because the model isn't the problem—the input is. Chunked pages lack the header context needed to disambiguate currency, columns, and invoice identity. Adding a 3-line prompt anchor solves what better extractors cannot, because it restores the context stripped during chunking.

How much does prompt anchoring reduce token costs?

Prompt anchoring dropped token cost from roughly 14,000 to 3,800 per invoice, a 73% reduction. Because each page chunk becomes self-contained with its 3-line header, you no longer need to send the full document for context. Smaller anchored chunks also enable parallel extraction, cutting a 3-page invoice's processing time from 14 seconds to under 4.

When should I use chunk anchoring vs full-document context?

Use chunk anchoring when processing multi-page PDFs where header information (currency, schema, IDs) appears only on early pages. Sending the full document wastes tokens and slows processing. Anchored chunks work better when you need parallelization, lower costs, or higher accuracy on structured documents like invoices, where each page must be interpreted with consistent reference data.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.