Claude Invents Contract Clauses From PDFs — Preprocessor Fix

Abstract tech illustration: Claude Invents Contract Clauses From PDFs — Preprocessor Fix

Last week I caught Claude fabricating a termination clause in a 52-page client agreement. The clause didn't exist. The section number it cited was real, but pointed at governing law. If you're feeding contracts, RFPs, or long reports into Claude for client work, you are almost certainly shipping hallucinations and not catching them.

Most tutorials test this on 2-page invoices where everything looks fine. Here's what actually happens on real documents — and the ~90-line preprocessor that pushed recall from 71% to 97% and cut token cost by 74%.

The failure mode: confident, well-numbered, completely invented

The setup: a client's 52-page master services agreement. Standard legal boilerplate. Six termination clauses scattered across sections 4, 9, and an appendix. I drop the PDF straight into Claude — the way every tutorial tells you to — and ask one prompt:

List every termination clause with the exact section number.

Claude returns seven clauses. One of them references section 11.3, which sounds plausible. Section 11.3 in the actual document is about governing law. The clause is fabricated — plausible language, plausible numbering, completely invented.

This isn't a one-off. Across 40 test contracts, raw PDF upload gave us:

  • Recall: 71% (missed clauses)
  • Hallucination rate: 12.4% (invented clauses or wrong sections)
  • Avg token cost per contract: $0.42

For a client intake pipeline, that's unshippable. If a paralegal misses a termination clause, somebody catches it on review. If your AI confidently invents one and your reviewer trusts it, you've shipped a liability.

Why PDF upload breaks on long documents

When you upload a PDF, Claude receives a flattened text stream. Here's what gets mashed together into one undifferentiated blob:

  • Running headers and footers on every page
  • Page numbers
  • The cover page and signature blocks
  • The table of contents (which looks structurally identical to actual section headings)
  • The clauses themselves

Section hierarchy is gone. Indentation is gone. There is no stable anchor the model can point at. So when Claude is uncertain — and on a 50-page document it gets uncertain a lot — it falls back on pattern matching. It knows what a contract usually looks like. It generates a clause that fits the pattern, attaches a section number that sounds right, and hands it back with full confidence.

The common advice is "convert to markdown." That helps on short documents. On 50 pages it still degrades, because ## headings don't give the model a stable, citeable address. Markdown got us to 83% recall. Better. Still not production-grade.

The fix that actually works: chunked HTML with anchor IDs. Give Claude something it can quote by handle.

The preprocessor pipeline

The whole thing is about 90 lines of Python: pdfplumber, BeautifulSoup, and a small heuristics module for heading detection. Runs in under 4 seconds on a 50-page contract.

The four steps

  • Extract with bounding boxespdfplumber reads page by page and preserves font size + position, so heading detection isn't guesswork
  • Strip boilerplate — running headers, page numbers, signature blocks, cover page (removes ~30% of tokens before Claude ever sees the doc)
  • Wrap each section in an anchored <div> — section 4.2 becomes <div id="section-4-2">, sub-clauses get nested IDs
  • Chunk by section, not by token count — Claude always sees a complete clause with its anchor, never a clause split across two chunks

Here's the heading-detection core. Sizes will vary per template, so the heuristic is "anything meaningfully bigger than body text, in bold or numbered":

import pdfplumber
from bs4 import BeautifulSoup
import re

SECTION_RE = re.compile(r'^(\d+(?:\.\d+)*)\s+(.+)
#39;) def extract_blocks(pdf_path): blocks = [] with pdfplumber.open(pdf_path) as pdf: body_size = median_font_size(pdf) # established from a sample of pages for page in pdf.pages: for word in page.extract_words(extra_attrs=["size", "fontname"]): blocks.append({ "text": word["text"], "size": round(word["size"], 1), "bold": "Bold" in word["fontname"], "page": page.page_number, }) return blocks, body_size def is_heading(line, body_size): if line["size"] >= body_size + 1.5: return True if line["bold"] and SECTION_RE.match(line["text"]): return True return False

And the HTML assembly — this is the part that actually fixes Claude's behavior:

def to_anchored_html(sections):
    soup = BeautifulSoup("<html><body></body></html>", "html.parser")
    body = soup.body
    for sec in sections:
        sec_id = "section-" + sec["number"].replace(".", "-")
        div = soup.new_tag("div", id=sec_id)
        h = soup.new_tag("h2")
        h.string = f'{sec["number"]} {sec["title"]}'
        div.append(h)
        p = soup.new_tag("p")
        p.string = sec["body"]
        div.append(p)
        body.append(div)
    return str(soup)

Boilerplate stripping is mostly negative filters: drop the top and bottom 6% of each page (kills running headers/footers on most templates), drop any line matching ^Page \d+ of \d+$, drop the cover page, drop the signature block (detected by the cluster of ____________ underscores).

The results on the same 40 contracts

Same prompt. Same model. Same documents. Only the input format changed.

Metric Raw PDF Markdown Anchored HTML
Recall 71% 83% 97%
Hallucination rate 12.4% 6.1% <2%
Avg cost / contract $0.42 $0.31 $0.11
Preprocess time 0s ~2s ~3.8s

Re-running the original prompt on the preprocessed 52-page agreement, Claude returns six clauses — the correct number — and every one cites a real anchor: section-4-2, section-9-1, appendix-b-3. I can grep the source HTML and confirm each citation exists. That's the part that matters: the output is verifiable without re-reading 50 pages.

On a pipeline processing 30 contracts a day, $0.42 → $0.11 is real money. About $280/month at that volume, and that's before you count the cost of human review time spent catching hallucinations that don't exist anymore.

Why this works (and what it means for everything else)

Claude isn't broken. It's doing exactly what it should with the input it receives. The problem is the input.

Upload garbage structure, get garbage citations. Give it anchored, hierarchical HTML, and it behaves like a careful junior associate who quotes the section before answering. The anchor IDs do two things: they give the model a stable address to point at when it's certain, and they make uncertainty visible because there's no fallback pattern to fabricate against. Asking for a section ID Claude can't find is a much harder task than asking for a section number it can guess.

This generalizes. We have the same preprocessor sitting in front of every long-document workflow we ship:

  • Contract intake
  • RFP analysis
  • Policy review
  • Vendor agreement comparison
  • Insurance claim docs

Anything over 10 pages goes through it before it ever touches Claude. The rule on my machine is simple: if the document has section numbers in it, the LLM should never see it without anchors.

Why bizflowai.io helps with this

Long-document workflows are one of the most common things we automate for clients — contract intake, RFP triage, vendor comparison. The PDF-to-anchored-HTML preprocessor I described above is the same pattern that runs in front of those production pipelines, paired with section-level chunking, retrieval, and verified citations so the human reviewer can click straight to the source. That's the difference between an AI demo and something a small team can actually trust in a billing workflow.

Frequently asked questions

Why does uploading a PDF directly to Claude cause hallucinated citations?

When you upload a PDF, Claude receives a flattened text stream where headers, footers, page numbers, signature blocks, and the table of contents get mashed in with actual content. Section hierarchy and indentation are lost, leaving no stable anchor to cite. When uncertain, Claude pattern-matches what a contract usually looks like and generates plausible but fabricated clauses with invented section numbers.

How do I preprocess long PDFs so Claude cites sections accurately?

Use a four-step pipeline: extract text with pdfplumber while preserving bounding boxes to detect headings by font size; strip boilerplate like running headers, page numbers, and signature blocks; wrap each section in HTML divs with anchor IDs (e.g., div id="section-4-2") using BeautifulSoup; then chunk by section rather than token count so Claude always sees complete clauses with anchors.

What recall and hallucination rates does chunked HTML preprocessing achieve versus raw PDF upload?

Across 40 test contracts, raw PDF upload produced 71% recall with a hallucination rate above 12%. Converting to markdown improved recall to 83% but remained unreliable on 50-page documents. Chunked HTML with anchor IDs reached 97% recall and dropped hallucinations to under 2%, because each cited anchor ID can be grep'd and verified against the source HTML.

When should I use HTML preprocessing instead of markdown conversion for Claude?

Markdown conversion works for short documents but degrades on long ones because markdown headings don't give the model a stable address to cite. Use chunked HTML with anchor IDs for any document over 10 pages, especially contracts, RFPs, policy reviews, and vendor agreements where accurate section-level citation is required for production workflows.

How much does HTML preprocessing reduce token costs for long-document workflows?

Stripping boilerplate—running headers, page numbers, signature blocks, and cover pages—removes roughly 30% of the tokens in a typical contract. In one pipeline, token cost per contract dropped from 42 cents to 11 cents. At 30 contracts per day, that's a meaningful savings, and the preprocessor itself runs in under four seconds on a 50-page contract using about 90 lines of Python.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

Why does uploading a PDF directly to Claude cause hallucinated citations?

When you upload a PDF, Claude receives a flattened text stream where headers, footers, page numbers, signature blocks, and the table of contents get mashed in with actual content. Section hierarchy and indentation are lost, leaving no stable anchor to cite. When uncertain, Claude pattern-matches what a contract usually looks like and generates plausible but fabricated clauses with invented section numbers.

How do I preprocess long PDFs so Claude cites sections accurately?

Use a four-step pipeline: extract text with pdfplumber while preserving bounding boxes to detect headings by font size; strip boilerplate like running headers, page numbers, and signature blocks; wrap each section in HTML divs with anchor IDs (e.g., div id="section-4-2") using BeautifulSoup; then chunk by section rather than token count so Claude always sees complete clauses with anchors.

What recall and hallucination rates does chunked HTML preprocessing achieve versus raw PDF upload?

Across 40 test contracts, raw PDF upload produced 71% recall with a hallucination rate above 12%. Converting to markdown improved recall to 83% but remained unreliable on 50-page documents. Chunked HTML with anchor IDs reached 97% recall and dropped hallucinations to under 2%, because each cited anchor ID can be grep'd and verified against the source HTML.

When should I use HTML preprocessing instead of markdown conversion for Claude?

Markdown conversion works for short documents but degrades on long ones because markdown headings don't give the model a stable address to cite. Use chunked HTML with anchor IDs for any document over 10 pages, especially contracts, RFPs, policy reviews, and vendor agreements where accurate section-level citation is required for production workflows.

How much does HTML preprocessing reduce token costs for long-document workflows?

Stripping boilerplate—running headers, page numbers, signature blocks, and cover pages—removes roughly 30% of the tokens in a typical contract. In one pipeline, token cost per contract dropped from 42 cents to 11 cents. At 30 contracts per day, that's a meaningful savings, and the preprocessor itself runs in under four seconds on a 50-page contract using about 90 lines of Python.