Claude Picks the Wrong Page 34% of the Time — Here's the Fix

I asked Claude which page held the termination clause in a 47-page commercial contract. It said page 12. The clause was on page 28. I re-ran the same prompt across 50 contracts from a law office client — wrong page 34% of the time. If you're quoting page numbers back to clients from long PDFs, you are shipping bad citations right now.
The Failure Mode Nobody Benchmarks
Drop a 47-page contract into Claude. Ask: "Which page contains the termination clause?" You get a confident, specific answer: page 12. Scroll to page 12 — it's the definitions section. The actual clause sits on page 28.
The trap is that the clause text Claude quotes back is correct. Only the page number is hallucinated. So nobody double-checks. The answer reads like ground truth, gets pasted into a memo, and ends up in front of a paying client.
I ran a structured benchmark on 50 contracts (commercial leases, supplier agreements, NDAs — all between 30 and 90 pages):
| Document length | Raw PDF citation accuracy |
|---|---|
| 1–5 pages | 100% |
| 6–20 pages | ~94% |
| 21–50 pages | 71% |
| 50+ pages | 58% |
Aggregated across the full 50-doc set: 66% correct, 34% wrong. Not random wrong either — wrong in a way that looks plausible, which is worse than obviously wrong.
Why Claude Loses the Page Anchor Past 20 Pages
When you upload a PDF through the standard API or web interface, Claude's parser extracts text content but does not maintain a durable page-to-text anchor map past roughly 20 pages. On short documents you never notice. On a 5-page invoice the mapping holds. On a 47-page contract, the internal correspondence between text spans and page numbers drifts, and when you ask "what page is this on?", the model falls back to inferring from document structure — headers, section numbering, table-of-contents hints.
It's not lying. It just doesn't have the anchor anymore, so it guesses. And LLMs are very good at producing confident-sounding guesses.
Most YouTube tutorials and vendor demos never hit this because they benchmark on:
- Single-page invoices
- 2–3 page receipts
- Short policy docs
The moment you move to real legal or financial documents — anything where page citations actually matter — the failure rate climbs hard.
What I tested before settling on the fix
- Asking Claude to "be careful with page numbers" — no measurable change.
- Chain-of-thought "first identify the page, then quote the text" — accuracy went up to ~72%. Still not shippable.
- Splitting the PDF and uploading pages individually — works, but kills context. Claude can't reason across the whole document.
- Preprocessing with explicit page tags — 98%. This is the one.
The Six-Line Fix
The idea is dumb-simple: stop trusting Claude's internal page mapping. Inject the page number directly into the text content as an HTML comment, so the page anchor is literally part of the prompt context the model reads.
import pdfplumber
with pdfplumber.open("contract.pdf") as pdf:
with open("contract.md", "w") as out:
for i, page in enumerate(pdf.pages, start=1):
out.write(f"<!-- PAGE {i} -->\n{page.extract_text() or ''}\n\n---\n\n")
That's it. Six lines. What you get is a markdown file that looks like:
<!-- PAGE 1 -->
COMMERCIAL LEASE AGREEMENT
This agreement is entered into...
---
<!-- PAGE 2 -->
1. DEFINITIONS
"Premises" shall mean...
---
<!-- PAGE 28 -->
14. TERMINATION
Either party may terminate this agreement...
Feed contract.md into Claude instead of the raw PDF. Ask the same question. Claude reads the `` comment sitting directly above the clause text and answers page 28. Every time.
Re-running the 50-contract benchmark with preprocessed markdown:
| Metric | Raw PDF | Preprocessed |
|---|---|---|
| Citation accuracy | 66% | 98% |
| Avg. tokens per doc | ~14k | ~15k |
| Preprocessing time | 0s | 0.4–1.2s |
| Paralegal verification time per doc | ~4 min | ~0 min |
The 2% that still miss are clauses that span a page boundary, where Claude picks the page where the clause starts vs. where the operative language sits. Fixable with a span-tagging variant, but it's a different problem.
Why HTML Comments and Not Markdown Headers
This is the part that bit me first. My initial version used:
## PAGE 28
[text]
Claude sometimes treated those headers as document structure and reordered them in summaries, or rewrote them as "Section 28" in outputs. The page anchor leaked into the model's understanding of the document hierarchy.
HTML comments solve this cleanly:
- Invisible to any markdown renderer (so if you display the doc, comments don't show)
- Visible to the model as raw text in the prompt
- Not interpreted as headings, list items, or any structural element
- Survive copy-paste and tool calls
Other format options I tested
[PAGE 28]inline brackets — works, but pollutes summaries (Claude sometimes quotes the brackets).<page n="28">...</page>XML tags — works well, slightly heavier on tokens, and Claude occasionally tries to "close" the tag in output.- HTML comments — cleanest. Zero output pollution. Recommended.
Production Hardening: Scanned Pages, OCR, and Edge Cases
The six-line version works on text-native PDFs. Real client document pipelines see worse inputs. Here's the version I actually run in production for the law office client:
import pdfplumber
from pathlib import Path
def preprocess_pdf(pdf_path: str, out_path: str, ocr_fallback=None):
with pdfplumber.open(pdf_path) as pdf, open(out_path, "w") as out:
for i, page in enumerate(pdf.pages, start=1):
text = page.extract_text() or ""
if not text.strip() and ocr_fallback:
text = ocr_fallback(page.to_image(resolution=200).original)
out.write(f"<!-- PAGE {i} -->\n{text}\n\n---\n\n")
# Usage with Tesseract fallback for scanned pages:
# preprocess_pdf("lease.pdf", "lease.md", ocr_fallback=tesseract_ocr)
Things that will break the naive version
- Scanned/rotated pages —
pdfplumberreturns empty strings. Pipe through Tesseract or a cloud OCR before tagging. Without this, you get `` followed by nothing, and Claude assumes page 14 is blank. - Multi-column layouts —
extract_text()can interleave columns. For contracts with sidebar definitions, usepage.extract_text(layout=True)or extract by bounding box. - Embedded tables — extract separately with
page.extract_tables()and inject as markdown tables under the same page tag. - Form fields / fillable PDFs —
pdfplumberskips form field values. Usepypdfto pull form data and append.
For the law office pipeline, the full preprocessor is about 80 lines. The core idea is still the six-line snippet — everything else is handling input quality.
What Changed for the Client
Before: paralegals manually verified every page citation Claude produced. ~4 minutes per contract, multiplied by 30–50 documents a day. That's two to three hours of senior paralegal time burned on "is this page number right?"
After: page tags are part of the prompt context. When Claude quotes a clause and cites page 28, page 28 is provably correct because the `` anchor is sitting in the same chunk as the clause text. No verification step needed for citations — only for legal interpretation, which is what the paralegal should be doing anyway.
That's the difference between an AI demo that impresses people in a meeting and a system you can put in front of a paying client without fearing a Monday-morning email.
Why bizflowai.io helps with this
This preprocessor is one of about a dozen small, boring fixes that sit underneath the document automation pipelines I build at bizflowai.io — contract ingestion, lease abstraction, supplier-agreement review, invoice extraction. The work isn't glamorous: it's making sure citations are correct, OCR fallbacks fire on scanned pages, tables get preserved, and outputs are auditable before they ever reach a client. Most LLM document tools skip this layer because demos don't require it. Production does.
Frequently asked questions
Why does Claude hallucinate page numbers when citing long PDFs?
When you upload a PDF directly, Claude's parser extracts text but does not preserve a reliable page-to-text anchor past roughly 20 pages. Short documents work fine, but on longer files the internal page mapping drifts. When asked 'what page,' the model guesses based on document structure. In a benchmark across 50 long contracts, page citations were wrong 34% of the time, even though the quoted clause text was usually correct.
How do I fix Claude's page citation errors on long PDFs?
Preprocess the PDF into markdown before sending it to Claude. Using pdfplumber in Python, open the file, iterate over each page, and inject an explicit page tag as an HTML comment (like ) at the top of each page's extracted text, followed by a separator. Feed this tagged markdown to Claude instead of the raw PDF. Citations become provably accurate because the anchor is part of the text content.
Why use HTML comments instead of markdown headers for page tags?
Markdown headers like '## PAGE 28' can be interpreted by Claude as document structure, causing the model to reorder or restructure content. HTML comments () are invisible to rendering but visible to the model, so they act as a clean machine-readable anchor without interfering with how Claude parses the document's organization.
How much does PDF preprocessing improve Claude's citation accuracy?
In a benchmark of 50 long contracts, raw PDF uploads produced correct page citations only 66% of the time. After preprocessing with pdfplumber and injecting HTML comment page tags, citation accuracy jumped to 98%. The remaining 2% of errors involved clauses spanning two pages where the model picked the wrong one — a separate, fixable issue.
When should I add OCR to a PDF preprocessing pipeline?
Add an OCR step when your PDFs contain rotated or scanned pages. pdfplumber returns empty strings for image-based or rotated pages, which means those pages won't get usable text or reliable page tags. Piping such pages through OCR before tagging ensures every page contributes extractable text that Claude can anchor citations to.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why does Claude hallucinate page numbers when citing long PDFs?
When you upload a PDF directly, Claude's parser extracts text but does not preserve a reliable page-to-text anchor past roughly 20 pages. Short documents work fine, but on longer files the internal page mapping drifts. When asked 'what page,' the model guesses based on document structure. In a benchmark across 50 long contracts, page citations were wrong 34% of the time, even though the quoted clause text was usually correct.
How do I fix Claude's page citation errors on long PDFs?
Preprocess the PDF into markdown before sending it to Claude. Using pdfplumber in Python, open the file, iterate over each page, and inject an explicit page tag as an HTML comment (like <!-- PAGE 28 -->) at the top of each page's extracted text, followed by a separator. Feed this tagged markdown to Claude instead of the raw PDF. Citations become provably accurate because the anchor is part of the text content.
Why use HTML comments instead of markdown headers for page tags?
Markdown headers like '## PAGE 28' can be interpreted by Claude as document structure, causing the model to reorder or restructure content. HTML comments (<!-- PAGE 28 -->) are invisible to rendering but visible to the model, so they act as a clean machine-readable anchor without interfering with how Claude parses the document's organization.
How much does PDF preprocessing improve Claude's citation accuracy?
In a benchmark of 50 long contracts, raw PDF uploads produced correct page citations only 66% of the time. After preprocessing with pdfplumber and injecting HTML comment page tags, citation accuracy jumped to 98%. The remaining 2% of errors involved clauses spanning two pages where the model picked the wrong one — a separate, fixable issue.
When should I add OCR to a PDF preprocessing pipeline?
Add an OCR step when your PDFs contain rotated or scanned pages. pdfplumber returns empty strings for image-based or rotated pages, which means those pages won't get usable text or reliable page tags. Piping such pages through OCR before tagging ensures every page contributes extractable text that Claude can anchor citations to.