Claude Hallucinates PDF Sections 41% — Send JSON Instead

Abstract tech illustration: Claude Hallucinates PDF Sections 41% — Send JSON Instead

You drop a 38-page supplier catalog into Claude, ask which category product X-204 belongs to, and get a confident wrong answer. Run a hundred lookups and 41 of them are wrong. That's not a model failure — it's a structure failure, and almost nobody building agent pipelines talks about it.

I run this preprocessing pipeline in production across roughly 1,200 documents a month. Here's the exact fix that took us from 59% accuracy to 97%, and dropped cost-per-query by 6.2x.

Why the PDF parser kills your agent

When you upload a native PDF to Claude, the parser does extract the text. That part works fine. What it does not preserve is the document hierarchy — H1, H2, H3, the parent-child relationship between a category heading and the rows beneath it. All of that gets flattened into one long token stream.

To Claude, the heading Industrial Fasteners and the heading Garden Tools look like the same kind of line as the product rows underneath them. There's no structural signal that says "everything below this heading until the next heading of equal or greater weight belongs to this category."

So when you ask which category does X-204 belong to, the model guesses based on proximity in the token stream. Sometimes proximity matches reality. Often it doesn't — especially across:

  • Page breaks where the category heading is on page 7 and the products spill onto page 8 with a footer in between
  • Multi-column layouts where the parser reads top-to-bottom in column 1, then jumps back up to column 2
  • Tables that get linearized into rows with no indication of which header column the cell belonged to

Markdown conversion helps a little — at least the ## markers survive. But for an agent pipeline that has to retrieve, cite, and act on a specific section, markdown is still ambiguous. The agent has no clean way to say "give me everything under section 4.2 and only that."

Step 1: Extract blocks with font metadata, not text

Stop sending the raw PDF. Pre-process it with a small Python script. I use pdfplumber because it gives you text blocks with font size and position metadata, which is the cheapest possible signal for heading detection.

The rule: the largest font is H1, the next tier is H2, body text is the smallest. Twenty lines of logic gets you 90% of the way there for any reasonably formatted business document.

import pdfplumber
from collections import Counter

def extract_blocks(pdf_path):
    blocks = []
    with pdfplumber.open(pdf_path) as pdf:
        for page_num, page in enumerate(pdf.pages, start=1):
            for char_group in page.extract_words(extra_attrs=["size", "fontname"]):
                blocks.append({
                    "text": char_group["text"],
                    "size": round(char_group["size"], 1),
                    "page": page_num,
                    "x0": char_group["x0"],
                    "top": char_group["top"],
                })
    return blocks

def detect_heading_tiers(blocks):
    sizes = Counter(b["size"] for b in blocks)
    # Body text is the most common size. Headings are larger and rarer.
    body_size = sizes.most_common(1)[0][0]
    heading_sizes = sorted(
        [s for s in sizes if s > body_size and sizes[s] < sizes[body_size] * 0.1],
        reverse=True,
    )
    return {size: f"H{i+1}" for i, size in enumerate(heading_sizes[:3])}

That's the whole heuristic. You're not training anything. You're using the typesetting decisions the document author already made.

Step 2: Emit structured JSON with a parent chain

Once you have blocks tagged by tier, walk them in order and build a tree. Every section becomes a JSON object with four fields:

  • section_id — a stable slug like sec_4_2_fasteners
  • parent_id — points to the enclosing section, or null for root
  • page — where it lives in the source PDF (for human verification)
  • content — the actual text under this heading, up to the next heading of equal or greater weight
{
  "section_id": "sec_4_industrial",
  "parent_id": null,
  "page": 12,
  "title": "Industrial Fasteners",
  "content": "All-purpose fasteners for structural applications..."
},
{
  "section_id": "sec_4_2_stainless",
  "parent_id": "sec_4_industrial",
  "page": 14,
  "title": "Stainless Steel (304/316)",
  "content": "Product X-204 — M8x40 hex bolt, 316 grade...\nProduct X-205 — M10x50..."
}

Now the category Industrial Fasteners is a node. Every product underneath is a child node with that node as its parent. The relationship is explicit, not inferred from token distance. That single change is what kills the hallucination.

What makes this representation work

  • Section IDs are stable across reprocessing, so your indexes don't churn
  • Parent IDs let you reconstruct the full breadcrumb for any chunk in one SQL query
  • Page numbers let a human auditor verify any answer in three seconds
  • Content is bounded — no chunk accidentally spans two sections

Step 3: Index it where you already store things

Don't overcomplicate this. A SQLite table works for most clients I've shipped this for. The schema is four columns and one index:

CREATE TABLE sections (
    section_id TEXT PRIMARY KEY,
    parent_id  TEXT REFERENCES sections(section_id),
    page       INTEGER,
    title      TEXT,
    content    TEXT
);
CREATE INDEX idx_parent ON sections(parent_id);

If you need semantic search, push the same JSON into a vector store and store section_id and parent_id as metadata fields on every chunk. The point is: each chunk knows where it lives in the tree. Retrieval can now answer "give me section 4.2 and all its children" with a single recursive query instead of hoping the model reads context in the right order.

Step 4: Send sections, not documents, to Claude

This is where the cost and accuracy numbers both move. Instead of stuffing the whole 38-page document into context, your agent retrieves the relevant section by section_id and sends only that. Then it asks Claude to cite the section_id back in its answer.

def answer_with_citation(question, db):
    # Step A: find candidate section
    candidate = semantic_lookup(question, top_k=1)
    section = db.execute(
        "SELECT section_id, title, content, parent_id FROM sections WHERE section_id = ?",
        (candidate["section_id"],)
    ).fetchone()

    # Step B: include parent context for hierarchy questions
    parent = db.execute(
        "SELECT title FROM sections WHERE section_id = ?",
        (section["parent_id"],)
    ).fetchone()

    prompt = f"""You are answering from a structured document.
Parent section: {parent['title'] if parent else 'ROOT'}
Section: {section['title']} (id: {section['section_id']})
Content:
{section['content']}

Question: {question}

Respond in JSON: {{"answer": ..., "section_id": "..."}}.
If the answer is not in this section, return section_id: null."""
    
    return claude.complete(prompt)

Roughly 800 tokens per query instead of 5,000. Claude returns the section_id as a citation, so you can verify exactly where the answer came from — and if the model says section_id: null, your agent knows to retry retrieval rather than fabricate.

Real numbers from the same catalog

Same 38-page catalog. Same 100 category lookups. Same Claude model. The only thing that changed is what we put in the context window:

Approach Correct Wrong Avg tokens/query Cost per 1k queries
Native PDF upload 59 41 ~5,000 ~$40
Markdown conversion 78 22 ~4,200 ~$33
Structured JSON + section retrieval 97 3 ~800 ~$6

The three remaining failures in the JSON pipeline were genuinely ambiguous entries in the source document — products that were physically listed under two categories. That's a source data problem, not a model problem.

The cost delta is the part that compounds. At 1,000 queries a month it's $40 vs $6. At the volume some of my clients run — tens of thousands of document queries a month across supplier catalogs, contracts, and compliance manuals — that's the difference between a viable margin and a project that bleeds money on inference.

Where else this pattern earns its keep

The hierarchy-as-data pattern generalizes to any document where the location of a fact matters as much as the fact itself:

  • Contracts with numbered clauses — agent needs to cite "Section 7.3(b)" not just paraphrase
  • Compliance manuals with nested policies — child policy inherits constraints from parent
  • Technical spec sheets — feature belongs to a specific product variant, not the whole product line
  • Insurance policy documents — coverage and exclusions live in different sections that reference each other

Any time an agent needs to know which section a fact came from, you need that hierarchy preserved as structured data — not flattened into prose and hoped for.

Why bizflowai.io helps with this

This is the kind of plumbing we ship for clients who run document-heavy operations through bizflowai.io — supplier catalog lookups, contract Q&A, spec-sheet routing. The preprocessing pipeline, the SQLite or vector index, the retrieval-with-citation agent loop. It's not glamorous work, but it's the difference between an agent that confidently lies 41% of the time and one that's accurate enough to actually let run autonomously in a business workflow.

Frequently asked questions

Why does Claude give wrong answers when reading PDFs?

When you upload a native PDF to Claude, the parser extracts text but flattens the document hierarchy. Headings like H1, H2, and H3, plus parent-child relationships between categories and their products, become one long token stream. Claude then guesses section membership based on token proximity, which fails across page breaks, footers, and multi-column layouts. In one test, this produced 41% wrong answers on category lookups.

How do I preserve PDF structure for Claude?

Pre-process the PDF with a Python script using pdfplumber to extract text block by block with font size and position metadata. Use font size as a heading-level signal (largest = H1). Emit structured JSON where each section has a section_id, parent_id, page number, and content. Index that JSON in SQLite or a vector store, then have your agent retrieve only relevant sections by section_id instead of sending the full document.

What accuracy improvement does structured JSON give over native PDF?

On the same 38-page catalog and 100 category lookups, native PDF upload produced 59 correct and 41 wrong answers. A structured JSON pipeline with section IDs produced 97 correct and only 3 wrong, and those three failures were genuinely ambiguous entries in the source document rather than model errors. Cost per query also dropped 6.2x because fewer tokens were sent per request.

Why is markdown conversion not enough for agent pipelines?

Markdown conversion keeps headings visible, which helps somewhat, but it remains ambiguous for agents that must retrieve, cite, and act on content. An agent has no clean way to request 'everything under section 4.2 and only that' from markdown. Structured JSON with explicit section_id and parent_id fields makes the hierarchy queryable as data, enabling precise retrieval and verifiable citations.

When should I use a structured JSON pipeline instead of raw PDF upload?

Use a structured JSON pipeline whenever an agent needs to know which section a fact came from. This includes supplier catalogs with categories and products, contracts with numbered clauses, compliance manuals with nested policies, and technical spec sheets. For any document-heavy workflow requiring retrieval, citation, or action on specific sections, preserving hierarchy as data outperforms sending flattened prose.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

Why does Claude give wrong answers when reading PDFs?

When you upload a native PDF to Claude, the parser extracts text but flattens the document hierarchy. Headings like H1, H2, and H3, plus parent-child relationships between categories and their products, become one long token stream. Claude then guesses section membership based on token proximity, which fails across page breaks, footers, and multi-column layouts. In one test, this produced 41% wrong answers on category lookups.

How do I preserve PDF structure for Claude?

Pre-process the PDF with a Python script using pdfplumber to extract text block by block with font size and position metadata. Use font size as a heading-level signal (largest = H1). Emit structured JSON where each section has a section_id, parent_id, page number, and content. Index that JSON in SQLite or a vector store, then have your agent retrieve only relevant sections by section_id instead of sending the full document.

What accuracy improvement does structured JSON give over native PDF?

On the same 38-page catalog and 100 category lookups, native PDF upload produced 59 correct and 41 wrong answers. A structured JSON pipeline with section IDs produced 97 correct and only 3 wrong, and those three failures were genuinely ambiguous entries in the source document rather than model errors. Cost per query also dropped 6.2x because fewer tokens were sent per request.

Why is markdown conversion not enough for agent pipelines?

Markdown conversion keeps headings visible, which helps somewhat, but it remains ambiguous for agents that must retrieve, cite, and act on content. An agent has no clean way to request 'everything under section 4.2 and only that' from markdown. Structured JSON with explicit section_id and parent_id fields makes the hierarchy queryable as data, enabling precise retrieval and verifiable citations.

When should I use a structured JSON pipeline instead of raw PDF upload?

Use a structured JSON pipeline whenever an agent needs to know which section a fact came from. This includes supplier catalogs with categories and products, contracts with numbered clauses, compliance manuals with nested policies, and technical spec sheets. For any document-heavy workflow requiring retrieval, citation, or action on specific sections, preserving hierarchy as data outperforms sending flattened prose.