Claude Forgets Page 12 of Your PDF — Chunk It Like This

By Lazar Milicevic · Published June 13, 2026 · 8 min read

A small invoicing client sent me a 40-page supplier reconciliation statement and asked Claude to extract every line item into clean JSON. Page one was flawless. By page fifteen, accuracy had collapsed to 43% — and Claude sounded just as confident about the wrong numbers as it did about the right ones. If you're uploading long PDFs in one shot, you're shipping silent garbage into someone's books and you don't even know it.

The failure mode nobody warns you about

Here's what actually happened on that 40-page document. We extracted line items page by page, then cross-checked against the source PDF.

Page range	Extraction accuracy	Failure type
1–5	91%	Minor formatting only
6–10	78%	Occasional rounded totals
11–15	43%	Merged suppliers, invented line items
16–25	47%	Hallucinated invoice numbers
26–40	51%	Tail recovery, but unreliable

Notice the U-shape. The model pays more attention to what's near the start and end of the context window, and the middle goes fuzzy. This is positional attention decay, and it's well-documented in long-context LLMs ("lost in the middle"). It's not a Claude bug — every long-context model does this to some degree. Claude just happens to be the one most small businesses point at their PDFs.

The scary part isn't the wrong number. It's that there's no warning. No "confidence dropped here," no "I'm not sure about page 14." Just JSON that looks correct, with the wrong supplier merged into the wrong total, flowing straight into the accounting system.

Why "convert to markdown first" doesn't save you

The first thing most tutorials tell you to do is convert the PDF to markdown or HTML before uploading. The theory is that cleaner text helps the model parse better. I tested it on the same document, same prompt, same model:

Raw PDF upload: 43% accuracy at page 15
PDF → markdown (via pymupdf4llm): 46% at page 15
PDF → HTML (via pdfplumber + custom layout): 44% at page 15

A 1–3 point difference. Within noise. Format conversion is a real optimization for short documents where layout confuses the parser, but it does nothing for positional decay on long ones. You're still shoving 40 pages of text into a single context window, and the middle is still going to rot.

The fix isn't a better input format. It's never sending 40 pages in one call to begin with.

The chunking pattern: 5 pages, 1 overlap, running summary

Here's the pattern I deploy for clients. Three rules:

The rules

5 pages per chunk — small enough that the middle of the chunk is still close to the edges of the context window
1 page of overlap between consecutive chunks — catches line items that span page breaks
2–3 sentence running summary from the previous chunk passed into the next one — gives continuity without bloating tokens

So for a 40-page document:

Chunk 1: pages 1–5
Chunk 2: pages 5–9
Chunk 3: pages 9–13
Chunk 4: pages 13–17
… and so on through chunk 10 (pages 37–40)

The overlap page is doing real work. Supplier statements love to break tables across page boundaries — the header row sits on page 8, the totals row sits on page 9. Without overlap, you lose the join. With overlap, page 9 appears in both chunk 2 and chunk 3, and your deduplication step (which I'll get to) cleans up the duplicates using invoice number as the key.

The running summary is the part most people skip. It's not the previous chunk's full output — that defeats the purpose. It's three sentences:

"Currently processing supplier: Acme Distributing. Running subtotal: €18,420.55. Columns mapped: invoice_no, date, description, qty, unit_price, line_total."

That's enough for the model to keep state without re-reading anything.

The actual code (≈40 lines)

This is the working skeleton. Drop in your API key, point it at a PDF, and it runs.

import anthropic, json, pypdf
from pathlib import Path

client = anthropic.Anthropic()
PDF = Path("supplier_statement.pdf")
CHUNK_SIZE, OVERLAP = 5, 1

def extract_pages(pdf_path, start, end):
    reader = pypdf.PdfReader(pdf_path)
    return "\n".join(reader.pages[i].extract_text() for i in range(start, end))

def call_claude(pages_text, running_summary):
    prompt = f"""Previous chunk context: {running_summary}

Extract every line item from this supplier statement chunk as JSON.
Schema: [{{"invoice_no": str, "date": str, "supplier": str, "amount": float}}]
Return ONLY valid JSON. End with a one-line summary prefixed 'SUMMARY:'."""
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=4096,
        messages=[{"role": "user", "content": f"{prompt}\n\n{pages_text}"}],
    )
    text = msg.content[0].text
    json_part, _, summary = text.rpartition("SUMMARY:")
    return json.loads(json_part.strip()), summary.strip()

def run(pdf_path):
    reader = pypdf.PdfReader(pdf_path)
    total = len(reader.pages)
    all_items, summary = [], "First chunk, no prior context."
    start = 0
    while start < total:
        end = min(start + CHUNK_SIZE, total)
        text = extract_pages(pdf_path, start, end)
        items, summary = call_claude(text, summary)
        all_items.extend(items)
        start = end - OVERLAP
    # dedupe on invoice_no — overlap pages will repeat rows
    seen, deduped = set(), []
    for row in all_items:
        if row["invoice_no"] not in seen:
            seen.add(row["invoice_no"])
            deduped.append(row)
    return deduped

print(json.dumps(run(PDF), indent=2))

Run it on the same 40-page document that previously broke:

End-to-end accuracy: 89% across all pages
Worst chunk: 84% (chunk 6, pages 21–25 — dense multi-column layout)
Token cost: roughly 2.1× the single-shot call
Wall-clock time: ~38 seconds for 10 sequential chunks; ~9 seconds if you parallelize with asyncio.gather

Token cost doubled. The output went from "quiet liability sitting in someone's books" to "usable." That's a trade the finance team will take every time.

The validation pass that makes it production-grade

Extraction at 89% is good. It is not "push to accounting on Monday morning" good. The single check that closes that gap is a totals reconciliation pass:

def validate(extracted_items, printed_grand_total, tolerance=0.01):
    computed = sum(row["amount"] for row in extracted_items)
    delta = abs(computed - printed_grand_total)
    pct = delta / printed_grand_total if printed_grand_total else 1
    return {
        "computed_total": round(computed, 2),
        "printed_total": printed_grand_total,
        "delta": round(delta, 2),
        "within_tolerance": pct <= tolerance,
        "action": "auto_import" if pct <= tolerance else "human_review",
    }

You ask Claude (or grep the last page) for the printed grand total on the document. You sum the extracted line items. If they match within 1%, the pipeline imports automatically. If they don't, the document gets flagged and sits in a queue for a human to look at.

That's it. That's the difference between a demo and something a finance team trusts.

What this catches in practice

Missing line items — sum is too low, delta is positive, flagged
Hallucinated line items — sum is too high, delta is negative, flagged
Duplicate rows the dedup missed — same effect as hallucinations, flagged
Currency mix-ups (EUR vs USD on multi-currency statements) — sum is wildly off, flagged hard

On the same 40-page document, validation caught one chunk where the model had skipped a 4-line subtable. The pipeline refused to import, a human re-ran that chunk with a tighter prompt, totals matched, done. The alternative would have been finding the discrepancy three weeks later during month-end close.

Why bizflowai.io helps with this

This exact pipeline — chunked extraction, running summaries, dedup on a stable key, totals reconciliation, human-review queue — is the kind of thing I build into client systems at bizflowai.io when they're drowning in supplier statements, contracts, or monthly reports. Most of the businesses I work with already tried "just upload it to Claude" and got burned. The working version is plumbing: chunkers, validators, retry logic, a flag queue. It's not interesting on a landing page, which is why most AI tutorials skip it and most production systems need it.

Frequently asked questions

Why does Claude hallucinate on long PDF extractions?

Long PDF extraction failures are caused by positional attention decay, not file format. Large language models pay more attention to content near the start and end of the context window, while the middle gets fuzzy. In one test on a 40-page supplier reconciliation, accuracy dropped to 43% by page 15, with the model merging suppliers and rounding totals while sounding equally confident as on page one.

Does converting a PDF to markdown or HTML fix long-document accuracy?

No. Converting PDFs to markdown or HTML before uploading does not solve long-document extraction errors. When tested on the same 40-page document across PDF, markdown, and HTML formats, the accuracy decay curve was nearly identical past page 10. The root cause is positional attention decay inside the model's context window, which format conversion cannot address. Chunking the document is the actual fix.

How do I chunk a long PDF for Claude extraction?

Split the PDF into overlapping page ranges of five pages each with one page of overlap between chunks (pages 1-5, 5-9, 9-13, and so on). Send each chunk to Claude with the same structured extraction prompt plus a two or three sentence running summary of the previous chunk. Merge results by deduplicating on a unique key like invoice number, using the overlapping page as the join key.

How much does chunking improve PDF extraction accuracy?

On a 40-page supplier reconciliation statement that previously hit 43% accuracy by page 15, an overlapping chunking pipeline produced 89% end-to-end accuracy across all pages, with the worst chunk at 84%. Token cost roughly doubled compared to a single-pass extraction, but the output became reliable enough for production use instead of being a quiet liability in accounting records.

Why add a validation pass to PDF extraction pipelines?

A validation pass compares the sum of chunk totals against the document's printed grand total. If they do not match within a defined tolerance, the pipeline flags the document for human review instead of pushing potentially bad data into downstream systems like accounting software. This single check is what separates a demo from a workflow a finance team can actually trust in production.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.