Markdown Broke My Invoice Tables — HTML Got Claude to 97%

Stop converting your invoice PDFs to Markdown before sending them to Claude. On tabular documents that advice silently destroys your accuracy, and most tutorials never test it on the messy reality of real supplier invoices. I run these extraction pipelines for clients every week, and here's the part nobody mentions.
The failure mode nobody shows you
Drag a Serbian supplier invoice straight into Claude. Ask for the line items as JSON. Claude reads it, looks confident, returns a clean structured response. Then you diff it against the real invoice and the wheels come off.
On the batch I tested — 80 real supplier invoices from clients running Fakturko — native PDF upload landed at 71% line-item accuracy. Specifically, what went wrong:
- Unit price and VAT rate got concatenated into a single string (
"1250.00 20%"instead of two fields). - Quantity went missing on rows where the product description wrapped to two lines.
- On dense invoices with 15+ line items, Claude occasionally invented a row that didn't exist or skipped one that did.
71% sounds tolerable until you do the math. At 200 invoices a week, that's roughly 58 invoices needing a manual fix. Every wrong line is a person opening the PDF, finding the row, correcting the JSON, re-validating. You haven't automated anything — you've rebuilt the original problem with extra steps and an API bill on top.
This is the failure mode tutorials skip. They test on clean single-column receipts and call it a day. Real B2B invoices have multi-column tables, wrapping cells, merged headers, footnotes, and VAT breakdowns that span the bottom third of the page.
Why Markdown conversion isn't the fix
The standard advice, repeated in every "LLM document processing" post, is: convert the PDF to Markdown first, then send the text to Claude. Tools like pdfplumber, marker, or docling will happily do this for you.
I tried it. Accuracy jumped to 84%. Better, but still not production-grade. Here's why Markdown tables collapse on real invoices:
| Item | Qty | Unit Price | VAT | Total |
|-------------------------------|-----|------------|-----|---------|
| Industrial bearing 6204-2RS | 10 | 1250.00 | 20% | 15000.00|
| Hydraulic hose assembly 1/2" | | | | |
| with crimped fittings, 2m | 4 | 3400.00 | 20% | 16320.00|
| Filter cartridge replacement | 25 | 480.00 | 20% | 14400.00|
See the problem? The moment a product description wraps to two lines, the alignment grid breaks. The second row of "Hydraulic hose assembly" has empty pipes that don't match the column count of the row below it. Claude has to guess which value belongs to which column. On dense invoices with long Serbian product descriptions full of dimensions and part numbers, it guesses wrong.
The Markdown table format relies on visual alignment that only holds if every cell is a single line. Real invoices violate that constantly.
The HTML table trick
Here's the fix. Instead of Markdown, convert the tables to HTML. Claude was trained on enormous amounts of HTML, and <table>, <tr>, <td> tags give it explicit, unambiguous structure that survives any line wrapping because the structure is in the tags, not in whitespace.
The script is 28 lines:
import pdfplumber
from anthropic import Anthropic
def pdf_to_html_tables(pdf_path: str) -> str:
html_parts = ["<table>"]
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
tables = page.extract_tables()
for table in tables:
for row in table:
html_parts.append("<tr>")
for cell in row:
cell_text = (cell or "").strip().replace("\n", " ")
html_parts.append(f"<td>{cell_text}</td>")
html_parts.append("</tr>")
html_parts.append("</table>")
return "".join(html_parts)
client = Anthropic()
html = pdf_to_html_tables("invoice_001.pdf")
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system="You are an invoice parser. The input is an HTML table. "
"Return JSON with fields: description, quantity, unit_price, "
"vat_rate, line_total. Never invent values.",
messages=[{"role": "user", "content": html}],
)
print(response.content[0].text)
That's the whole thing. No fancy prompt engineering, no chain-of-thought, no few-shot examples, no function calling. A four-line system prompt and a clean HTML string in the user message.
What this gives you
- Explicit cell boundaries —
<td>tags are unambiguous. A wrapped description is still one<td>, not three half-rows. - Row integrity — every
<tr>is one logical line, regardless of how it rendered on the page. - No alignment dependency — whitespace and column widths don't matter anymore.
On the same 80 invoices, this hit 97% line-item accuracy. The 3% that failed were genuinely ambiguous — a smudged scan, a hand-corrected price, a row where even a human accountant would call the supplier to confirm.
The numbers side: tokens drop 4x
Here's where it gets interesting for anyone running this at volume.
| Approach | Avg tokens / invoice | Accuracy | Cost / invoice (Sonnet) | Cost / 200 per week |
|---|---|---|---|---|
| Native PDF upload | ~14,000 | 71% | ~$0.040 | ~$32/month |
| Markdown text | ~4,800 | 84% | ~$0.014 | ~$11/month |
| HTML tables | ~3,200 | 97% | ~$0.009 | ~$7/month |
Native PDF upload burns tokens because Claude has to process embedded image data — every page is essentially a rendered image plus text. The HTML version is lean structured text only. You drop from ~14k tokens to ~3.2k, accuracy climbs 26 percentage points, and the bill drops from $32 to $7 per month at 200 invoices a week.
This is the rare case where the cheaper path is also the more accurate one. Usually there's a tradeoff. Here there isn't.
When this approach breaks and what to do
HTML tables aren't a silver bullet. The approach has known limits I've hit on client work:
- Scanned PDFs (image-only) —
pdfplumber.extract_tables()returns nothing because there's no text layer. You need OCR first. I use Tesseract for Cyrillic + Latin Serbian invoices, then feed the OCR output into the same HTML pipeline. - Invoices where the line-item table isn't a real table — some suppliers format with tabs and spaces, no actual table structure.
extract_tables()returns empty. Fallback: extract raw text, then ask Claude to reconstruct the HTML table itself before the parsing step. Two API calls, but still cheaper than native PDF. - Merged header cells —
pdfplumbersometimes flattens these incorrectly. If your headers are off, add a tiny post-processing step that detects header rows by font weight (pdfplumberexposes character-level metadata) and wraps them in<th>instead of<td>. - Multi-page tables — concatenate all pages into one
<table>. Don't open and close the table per page or Claude will treat them as separate items and sometimes duplicate headers as data rows.
A practical heuristic I use: if extract_tables() returns at least one table with 3+ columns and 2+ rows, go HTML. If it returns nothing or garbage, fall back to OCR-then-HTML or raw-text-then-reconstruct.
A minimal validation harness
Don't trust accuracy numbers from a single eyeballed sample. Build a diff harness once, reuse it forever. Mine looks like this:
import json
from pathlib import Path
def diff_extraction(predicted: list[dict], truth: list[dict]) -> dict:
"""Returns per-field accuracy across all line items."""
fields = ["description", "quantity", "unit_price", "vat_rate", "line_total"]
correct = {f: 0 for f in fields}
total = min(len(predicted), len(truth))
for p, t in zip(predicted, truth):
for f in fields:
if str(p.get(f, "")).strip() == str(t.get(f, "")).strip():
correct[f] += 1
return {
"row_count_match": len(predicted) == len(truth),
"field_accuracy": {f: correct[f] / total for f in fields},
"line_item_accuracy": sum(correct.values()) / (len(fields) * total),
}
Label 20-30 invoices by hand once. Run every prompt change, every model version, every extraction tweak through the harness. This is how I knew Markdown was 84% and HTML was 97% — not vibes, actual diffs against ground truth.
Why bizflowai.io helps with this
This is exactly the kind of extraction pipeline we ship for clients who are drowning in supplier PDFs, contracts, delivery notes, or any document where structure matters more than prose. At bizflowai.io we build the boring infrastructure around it — the OCR fallback, the validation harness, the queue that retries failed extractions, the webhook that pushes clean JSON into accounting software like Fakturko or whatever stack the client runs. The HTML table trick is one piece. The pipeline around it is what turns 97% accuracy into a system that actually replaces manual data entry.
Frequently asked questions
Why does native PDF upload to Claude fail for invoice extraction?
When you drag a PDF invoice directly into Claude and request structured JSON, accuracy averages around 71 percent on line items. Columns get merged (unit price concatenated with VAT rate), and quantities disappear on multi-line rows. Claude must process embedded image data, which also burns roughly 14,000 tokens per invoice. At 200 invoices a week, the error rate forces enough manual fixes to defeat the automation.
Why is HTML better than Markdown for sending tables to Claude?
Markdown tables use pipes and dashes for alignment, which collapse the moment a cell wraps to two lines, forcing Claude to guess column assignments. HTML uses explicit table, tr, and td tags that survive line wrapping. Claude was trained on enormous amounts of HTML, so it reads the structure reliably. Switching from Markdown to HTML raised line-item accuracy from 84 percent to 97 percent on an 80-invoice test.
How do I extract invoice line items from PDFs with high accuracy?
Use pdfplumber to open the PDF and call extract_tables on each page. Loop through rows, wrap each cell in td tags, each row in tr tags, and concatenate inside a table tag. Send that HTML string to Claude as plain text with a short system prompt defining the parser role, expected JSON fields, and a rule against inventing values. The script is about 28 lines of Python.
How much does HTML table extraction reduce Claude token costs?
Native PDF upload consumes roughly 14,000 tokens per invoice because Claude processes embedded image data. Converting tables to HTML and sending lean structured text drops usage to about 3,200 tokens per invoice. On Claude Sonnet, that moves cost from around 4 cents to under 1 cent per invoice. At 200 invoices a week, monthly spend drops from 32 dollars to 7 dollars, with higher accuracy.
When should I use native PDF upload vs HTML table extraction with Claude?
Use native PDF upload for prose-heavy documents where layout fidelity matters and you need Claude to see formatting, signatures, or images. Use HTML table extraction when structure matters more than prose, such as supplier invoices, contracts with tabular data, or any document where rows and columns must map cleanly to JSON fields. HTML extraction delivers higher accuracy and roughly one-quarter the token cost.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why does native PDF upload to Claude fail for invoice extraction?
When you drag a PDF invoice directly into Claude and request structured JSON, accuracy averages around 71 percent on line items. Columns get merged (unit price concatenated with VAT rate), and quantities disappear on multi-line rows. Claude must process embedded image data, which also burns roughly 14,000 tokens per invoice. At 200 invoices a week, the error rate forces enough manual fixes to defeat the automation.
Why is HTML better than Markdown for sending tables to Claude?
Markdown tables use pipes and dashes for alignment, which collapse the moment a cell wraps to two lines, forcing Claude to guess column assignments. HTML uses explicit table, tr, and td tags that survive line wrapping. Claude was trained on enormous amounts of HTML, so it reads the structure reliably. Switching from Markdown to HTML raised line-item accuracy from 84 percent to 97 percent on an 80-invoice test.
How do I extract invoice line items from PDFs with high accuracy?
Use pdfplumber to open the PDF and call extract_tables on each page. Loop through rows, wrap each cell in td tags, each row in tr tags, and concatenate inside a table tag. Send that HTML string to Claude as plain text with a short system prompt defining the parser role, expected JSON fields, and a rule against inventing values. The script is about 28 lines of Python.
How much does HTML table extraction reduce Claude token costs?
Native PDF upload consumes roughly 14,000 tokens per invoice because Claude processes embedded image data. Converting tables to HTML and sending lean structured text drops usage to about 3,200 tokens per invoice. On Claude Sonnet, that moves cost from around 4 cents to under 1 cent per invoice. At 200 invoices a week, monthly spend drops from 32 dollars to 7 dollars, with higher accuracy.
When should I use native PDF upload vs HTML table extraction with Claude?
Use native PDF upload for prose-heavy documents where layout fidelity matters and you need Claude to see formatting, signatures, or images. Use HTML table extraction when structure matters more than prose, such as supplier invoices, contracts with tabular data, or any document where rows and columns must map cleanly to JSON fields. HTML extraction delivers higher accuracy and roughly one-quarter the token cost.