Claude Flattens PDF Tables Into Mush — Use HTML Not Markdown

Abstract tech illustration: Claude Flattens PDF Tables Into Mush — Use HTML Not Markdown

If you're piping supplier PDFs into Claude and it keeps attaching the wrong amount to the wrong date, the problem isn't the model. It's the format you're feeding it. I ran 94 line items from a 3-column Serbian bank statement through three pipelines this week — native upload, markdown, and HTML — and the gap between them is bigger than any prompt-engineering trick I've tried.

Here are the real numbers and the four-line preprocessing fix.

The scenario that breaks every "just upload the PDF" tutorial

A small accounting firm I work with receives roughly 200 supplier PDFs a week. Bank statements, utility bills, freelancer invoices. Mixed Serbian and English. Some scanned, some digital, all with the kind of table layouts that PDF tutorials politely ignore: merged header cells, sub-total rows, running balance columns, two-line descriptions.

The goal is boring and concrete: read each PDF, extract every line item, push it into their bookkeeping system as JSON. The failure mode is not boring. It looks like this:

  • Amount from row 14 gets attached to the date from row 13.
  • A two-line description ("Payment to supplier — invoice 2024/118") collapses into one mangled string.
  • The running balance column bleeds into the amount column. Every transaction is off by exactly the previous balance.

The output looks confident. That's the dangerous part. Claude doesn't say "I'm not sure." It commits.

What actually goes wrong with native PDF upload

I took one real statement — three columns (date, description, amount), merged header cells, running balance on the right, 94 transactions — and dropped it straight into Claude via the native PDF upload. The model's vision pipeline reads the page, but it flattens the table into prose internally before reasoning over it. Once the grid is gone, column alignment is gone.

Diffed against ground truth: 23 of 94 line items wrong. That's a 75.5% accuracy on a document the human eye reads correctly in 30 seconds.

The breakdown of errors was telling:

  • 11 amount-to-date misalignments (column drift)
  • 7 merged descriptions (multi-line rows collapsed)
  • 5 running-balance bleed-throughs (wrong column picked)

You cannot prompt your way out of this. The structure was destroyed before Claude ever started reasoning.

Markdown helps, but markdown tables cannot represent merged cells

The standard advice is: convert the PDF to markdown first, then send the markdown as text. Fine. I ran the same statement through Docling, exported to markdown, sent it as a text block.

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("statement.pdf")
md = result.document.export_to_markdown()

Result: 78 of 94 correct. Better, but still broken. The reason is structural, not statistical. Markdown tables look like this:

| Date       | Description     | Amount  |
|------------|-----------------|---------|
| 2024-03-12 | Wire transfer   | -120.00 |

The moment you have a header that spans two columns, a sub-total row, or a cell with a line break inside it, markdown collapses. Pipes and dashes can't encode colspan, rowspan, or <br>. The information is silently lost during the export step. Claude isn't failing — it's reading exactly what you sent it, which no longer matches reality.

Where markdown loses fidelity

  • Merged header cells (e.g. "Transactions" spanning Description + Reference)
  • Sub-total / running-balance rows
  • Multi-line cells (long descriptions wrap into the next row)
  • Right-aligned numeric columns with parentheses for negatives

The one-line fix: export HTML, not markdown

Same Docling pipeline. Change the export call. Send the raw HTML string as a plain text block in the user message — not as a file attachment, just inline text.

from anthropic import Anthropic
from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("statement.pdf")
html = result.document.export_to_html()

client = Anthropic()
msg = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=8000,
    messages=[{
        "role": "user",
        "content": (
            "Here is a bank statement as HTML. Extract every transaction "
            "as a JSON array with keys: date, description, amount. "
            "Ignore running balance and sub-totals.\n\n"
            f"{html}"
        ),
    }],
)

Result on the same 94-line statement: 92 of 94 correct. The two it missed were genuinely ambiguous in the source scan — a smudged digit and a date that was half-cut at a page boundary. A human reviewer flagged them too.

Why HTML wins: Claude was trained on enormous amounts of HTML. <thead>, <tbody>, <tr>, <td colspan="2">, <th rowspan="3"> — it reads these semantics natively. Markdown is a format for humans reading documentation. HTML is a format for machines reading structure. When your input is structured data, give the model structure.

The numbers across 200 documents

I ran the full week's intake — 200 documents, mixed types — through all three pipelines with the same prompt and same model. Here's what the firm's pipeline actually looked like:

Pipeline Accuracy Cost / doc Latency / doc
Native PDF upload 76% $0.018 ~11 s
Docling → Markdown 83% $0.006 ~4 s
Docling → HTML 97% $0.004 <3 s

Cost differences come from one thing: vision tokens. Native PDF upload pays for every page as an image. Markdown and HTML pipelines send text, which is cheaper per token and faster to process. HTML is actually slightly cheaper than markdown here because Claude one-shots the extraction instead of asking clarifying questions or returning malformed JSON that needs a retry.

For this firm, the math is:

  • 200 docs/week × $0.014 saved per doc × 52 weeks = ~$145/year just in API cost.
  • 76% → 97% accuracy means ~42 fewer documents per week sent to a human for correction. At 4 minutes of review time saved per document, that's ~2.8 hours/week back.

Same model. Same prompt. One preprocessing change.

Production gotchas I hit running this daily

This is the stuff that bites you on document #137, not document #1.

Watch for these in real pipelines

  • Inline CSS in Docling's HTML export. Sometimes Docling emits <td style="text-align:right; padding:4px;">…</td> on every cell. Strip it with a quick BeautifulSoup pass — it burns tokens and contributes nothing to extraction quality.
  • Multi-page statements hit context limits. For statements over ~15 pages, export per page and chunk. Send each page as its own API call, then concatenate the resulting JSON arrays. Don't try to stuff a 40-page statement into one prompt.
  • Scanned PDFs need OCR first. Docling has a built-in OCR pipeline — turn it on in the converter config. Without it, your HTML export will be empty <td></td> cells and you'll wonder why Claude returned an empty array.
  • Always log the raw HTML before you send it. When something goes wrong in production, you want to see exactly what Claude saw. I write every outgoing HTML payload to a timestamped file in ./debug/ with a 7-day rotation. Debugging extraction errors without this is guessing.

A small CSS-stripping helper that's saved me a few thousand tokens per document:

from bs4 import BeautifulSoup

def clean_html_for_llm(html: str) -> str:
    soup = BeautifulSoup(html, "html.parser")
    for tag in soup.find_all(True):
        for attr in ("style", "class", "id"):
            if attr in tag.attrs:
                del tag.attrs[attr]
    return str(soup)

Drop this between export_to_html() and the API call. On a typical bank statement it cuts the payload by 30-40%.

Why this generalizes beyond bank statements

Anything with a table — utility bills, supplier invoices, payroll registers, expense reports, inventory lists — has the same failure mode. The format conversion layer is doing 90% of the work, and most pipelines pick the wrong format.

A quick rule of thumb I now follow:

  • Pure prose document (contract, report) → markdown is fine.
  • Anything with a table, header, or grid → HTML.
  • Form-like layout with positional fields (passports, ID cards, structured forms) → still vision, but with explicit field prompts.

The model doesn't care that HTML is "uglier." It cares that the structure is preserved.

Why bizflowai.io helps with this

I run this exact pipeline daily on client invoicing and bookkeeping systems through bizflowai.io — Docling preprocessing, HTML extraction, Claude for line-item parsing, JSON straight into the client's accounting tool. For the accounting firm above, the same setup handles bank statements, supplier invoices, and utility bills in three languages without manual touch-up. It's not magic, it's just choosing the right preprocessing format and logging everything for when reality breaks your assumptions.

Frequently asked questions

Why does HTML beat Markdown for extracting tables from PDFs with Claude?

HTML preserves table semantics like thead, tbody, colspan, and rowspan, which Markdown cannot represent. Markdown collapses merged cells, spanning headers, and sub-total rows into flat pipes and dashes. Claude was trained on large amounts of HTML and reads table structure natively. In a 200-document test, HTML extraction hit 97% accuracy versus 83% for Markdown and 76% for native PDF upload.

How do I extract table data from PDFs using Docling and Claude?

Install Docling, instantiate a DocumentConverter, call convert on the PDF path, then call export_to_html on the result. Wrap that HTML string inside your Claude prompt as a plain text block in the user message, with an instruction like 'here is a bank statement as HTML, extract every transaction as a JSON array.' That's roughly four lines of Python before the API call.

When should I use native PDF upload versus an HTML preprocessing pipeline with Claude?

Use native PDF upload only for simple, prose-heavy documents without complex tables. For structured documents like bank statements, invoices, or utility bills with merged cells and multiple columns, use an HTML preprocessing pipeline via Docling. In testing, native upload produced 76% accuracy at 1.8 cents per document, while the HTML pipeline reached 97% accuracy at 0.4 cents and dropped latency from eleven seconds to under three.

What are common pitfalls when sending HTML tables to Claude for extraction?

Four issues to watch. Docling's HTML export sometimes includes inline CSS, which wastes tokens and should be stripped. Multi-page statements should be exported and chunked per page to avoid context limits. Scanned PDFs require running OCR inside Docling first using its built-in pipeline. Finally, log the raw HTML before sending it to Claude so you can debug exactly what the model received when extractions fail.

Why does native PDF upload fail on tables with Claude?

Claude's internal PDF parser flattens tables into prose, losing the grid structure. On a real Serbian bank statement with three columns and merged headers, native upload produced 23 wrong line items out of 94. Amounts attached to the wrong dates, two-line descriptions merged into one, and the running balance column bled into the amount column. Without preserved table structure, the model cannot reliably align rows and columns.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

Why does HTML beat Markdown for extracting tables from PDFs with Claude?

HTML preserves table semantics like thead, tbody, colspan, and rowspan, which Markdown cannot represent. Markdown collapses merged cells, spanning headers, and sub-total rows into flat pipes and dashes. Claude was trained on large amounts of HTML and reads table structure natively. In a 200-document test, HTML extraction hit 97% accuracy versus 83% for Markdown and 76% for native PDF upload.

How do I extract table data from PDFs using Docling and Claude?

Install Docling, instantiate a DocumentConverter, call convert on the PDF path, then call export_to_html on the result. Wrap that HTML string inside your Claude prompt as a plain text block in the user message, with an instruction like 'here is a bank statement as HTML, extract every transaction as a JSON array.' That's roughly four lines of Python before the API call.

When should I use native PDF upload versus an HTML preprocessing pipeline with Claude?

Use native PDF upload only for simple, prose-heavy documents without complex tables. For structured documents like bank statements, invoices, or utility bills with merged cells and multiple columns, use an HTML preprocessing pipeline via Docling. In testing, native upload produced 76% accuracy at 1.8 cents per document, while the HTML pipeline reached 97% accuracy at 0.4 cents and dropped latency from eleven seconds to under three.

What are common pitfalls when sending HTML tables to Claude for extraction?

Four issues to watch. Docling's HTML export sometimes includes inline CSS, which wastes tokens and should be stripped. Multi-page statements should be exported and chunked per page to avoid context limits. Scanned PDFs require running OCR inside Docling first using its built-in pipeline. Finally, log the raw HTML before sending it to Claude so you can debug exactly what the model received when extractions fail.

Why does native PDF upload fail on tables with Claude?

Claude's internal PDF parser flattens tables into prose, losing the grid structure. On a real Serbian bank statement with three columns and merged headers, native upload produced 23 wrong line items out of 94. Amounts attached to the wrong dates, two-line descriptions merged into one, and the running balance column bled into the amount column. Without preserved table structure, the model cannot reliably align rows and columns.