Claude Glues PDF Columns Together — 20 Lines of Python Fix

Drop a two-column vendor invoice straight into Claude and ask for JSON. On clean digital PDFs it works. On real vendor invoices from real clients, 38% come back with the description from the left column glued to the price from the right column. Your "automation" is now a manual cleanup job — and the fix isn't a better model, it's a 20-line preprocessor.
The bug nobody documents
Claude's native PDF reader linearizes the page top-to-bottom across the full page width. It doesn't model columns. So when a vendor invoice has a product description on the left and a price block on the right at the same y-coordinate, Claude reads them as one continuous line.
I measured this on a single client batch — 240 real Serbian vendor PDFs, mix of two-column and three-column layouts:
| Method | Column-merge errors | Avg input tokens / doc |
|---|---|---|
| Native PDF upload to Claude | 38% | ~14,000 |
| PDF → markdown → Claude | 22% | ~6,500 |
| pdfplumber tokens + bbox → Claude | 4% | ~5,000 |
Two things to notice. First, the Files API doesn't fix it — under the hood it still linearizes, same error rate. Second, the markdown-conversion advice you see everywhere only gets you halfway. It strips the spatial information that would have told Claude there are two columns in the first place.
The model isn't dumb. It's reading exactly what you sent. You sent a flattened page.
Why "send the PDF" is the wrong abstraction
When you upload a PDF to an LLM, you're trusting whatever rasterization + reading-order heuristic the vendor wired up. That heuristic was tuned on academic papers, contracts, and synthetic test sets — single-column or clean two-column reflowable text. Real invoices look nothing like that:
- Header row that spans the full page width, then splits into 2 or 3 columns
- Line-item table where descriptions wrap to 2-3 lines but quantities don't
- Footer totals that float right-aligned, often at the same y as a line item
- Logos and stamps that occupy the top-right corner
The model never sees coordinates. It sees a stream of tokens in some order the parser picked, and it has no way to ask "wait, was that price actually next to that description, or three inches to the right?"
The fix is to stop hiding the geometry. Extract the words yourself with their bounding boxes and let Claude do the column reasoning explicitly.
The 20-line preprocessor
import pdfplumber
import json
def extract_tokens(pdf_path):
pages = []
with pdfplumber.open(pdf_path) as pdf:
for page_num, page in enumerate(pdf.pages, start=1):
words = page.extract_words(
keep_blank_chars=False,
use_text_flow=False,
)
tokens = [
{
"text": w["text"],
"x0": round(w["x0"], 1),
"y0": round(w["top"], 1),
"x1": round(w["x1"], 1),
"y1": round(w["bottom"], 1),
}
for w in words
]
pages.append({"page": page_num, "tokens": tokens})
return pages
if __name__ == "__main__":
print(json.dumps(extract_tokens("invoice.pdf"), ensure_ascii=False))
That's the whole thing. No layout detection, no OCR tuning, no model swap. Every token comes out with text plus four numbers — x0, y0, x1, y1 — the bounding box in PDF points.
A single token in the output looks like:
{"text": "Toner", "x0": 52.3, "y0": 218.7, "x1": 78.1, "y1": 229.4}
{"text": "HP", "x0": 81.0, "y0": 218.7, "x1": 92.4, "y1": 229.4}
{"text": "12A", "x0": 95.2, "y0": 218.7, "x1": 110.8, "y1": 229.4}
{"text": "2", "x0": 322.1, "y0": 218.7, "x1": 328.5, "y1": 229.4}
{"text": "84,00", "x0": 410.4, "y0": 218.7, "x1": 432.9, "y1": 229.4}
{"text": "168,00", "x0": 498.2, "y0": 218.7, "x1": 525.0, "y1": 229.4}
Same y-coordinate, four distinct x-clusters. The geometry is now legible to anything that can read JSON.
The prompt that does the column reasoning
Send the token JSON to Claude with a prompt that names the structure explicitly. The phrase that does the work is "column by column":
You receive a JSON array of tokens extracted from one page of a vendor
invoice. Each token has a text value and a bounding box (x0, y0, x1, y1)
in PDF points. Origin is top-left.
Task:
1. Identify column boundaries by clustering tokens on x0.
2. Reconstruct reading order column by column — group all tokens in the
leftmost column sorted by y0, then the next column, etc.
3. Within the line-item region, return structured JSON:
[{ "description": str, "quantity": number, "unit_price": number,
"total": number }]
4. If a row's totals do not match quantity * unit_price within 0.02,
include "warning": "math_mismatch".
Tokens:
<paste JSON here>
What Claude does internally: it scans the x0 values, sees a cluster around 50 (descriptions), one around 320 (quantities), one around 410 (unit prices), one around 498 (totals). It groups by column first, sorts by y inside each column, then zips them back together as rows. The cross-column merging vanishes.
What changed for the model
- It no longer has to guess where columns are — the x-coordinates say it explicitly
- It can detect column count per page (some invoices switch between 2 and 3 columns mid-document)
- It can flag rows where the math doesn't reconcile, which catches OCR errors and missed tokens before they hit the accounting system
The token-cost side effect nobody mentions
Native PDF upload was burning ~14,000 input tokens per document because Claude rasterizes pages internally and feeds them as images. The token JSON stream averages 5,000 tokens for the same content.
On a single client running 800 invoices per month:
| Pipeline | Tokens/doc | Monthly tokens | Approx cost (Sonnet input pricing) |
|---|---|---|---|
| Native upload | 14,000 | 11.2M | ~$33.6 |
| Token JSON | 5,000 | 4.0M | ~$12.0 |
That's a 64% cost cut on every invoice, and the error rate dropped from 38% to 4% — roughly nine times more accurate. The cheaper path is also the better path. That's rare enough to underline.
Where it still breaks, and what to do
This preprocessor isn't magic. Three failure modes I've hit in production:
- Merged header cells. A header row like "Item / Qty / Price / Total" that spans columns visually but is rendered as one wide token will confuse the row-mapping step. Fix: a two-pass prompt — first call identifies the header row and column labels with their x-ranges, second call maps line items into those ranges.
- Scanned image PDFs. No text layer means pdfplumber returns an empty list. Run Tesseract first with
--psm 6and thetsvoutput format — you get the sametext + bboxstructure, so the downstream prompt is identical. The pipeline stays the same; only the extractor swaps. - Very dense pages. Pages with 80+ line items push the token JSON above 8k. Chunk by page, run page-level extractions in parallel, then concatenate the structured outputs. Don't try to stuff the whole document into one request.
Quick decision rules
- Digital PDF with text layer → pdfplumber
extract_words - Scanned PDF or image → Tesseract with bbox output, same downstream prompt
- Multi-page invoice → one request per page, parallel, merge results
- Tables with merged headers → two-pass: header detection, then row mapping
Why this generalizes beyond invoices
The pattern — extract geometry, send geometry, let the model reason about geometry — works anywhere LLMs are silently flattening 2D structure into 1D:
- Bank statements with debit/credit columns
- Contracts with margin annotations or two-column legal layouts
- Lab reports with reference ranges in a parallel column
- Shipping manifests with origin/destination side-by-side
Anywhere a human would say "obviously the price goes with that product, look at where it's printed on the page," the model needs the coordinates spelled out. The model didn't get smarter. We just stopped lying to it about where the words are on the page.
Why bizflowai.io helps with this
This token-plus-bbox preprocessor is the pattern we ship inside client invoice automations at bizflowai.io. It runs daily on Serbian vendor PDFs with two- and three-column layouts, the error rate stays under 5%, and the whole preprocessor is under 40 lines of Python wrapped around a structured Claude call. When a client says "automate our invoice intake," this is the layer between the email attachment and the accounting system — boring, deterministic, cheap, and the reason the AI part actually works.
Frequently asked questions
Why does Claude merge columns when extracting data from multi-column PDF invoices?
Claude doesn't know where the columns are on the page. It reads tokens left-to-right, top-to-bottom across the entire page as if it were one paragraph, which glues descriptions from the left column to prices from the right column. The Files API has the same issue because it linearizes the PDF internally. The fix is giving Claude spatial bounding box information for each word.
How do I extract line items from multi-column PDF invoices accurately?
Use pdfplumber to open the PDF and call extract_words on each page. This returns each token with its text and bounding box coordinates (x0, y0, x1, y1). Dump that list as JSON and send it to Claude with a prompt instructing it to reconstruct reading order column by column, then return structured JSON. This approach reduces extraction errors from 38% to around 4%.
When should I use bounding box JSON vs native PDF upload with Claude?
Use bounding box JSON for any multi-column document like vendor invoices, where spatial layout matters. On 240 real vendor PDFs, native upload produced 38% column-merge errors versus 4% with bbox JSON. Native upload also burned ~14,000 input tokens per document versus ~5,000 for JSON, a 64% cost reduction. Native upload only works reliably on clean, single-column digital documents.
How do I handle scanned image PDFs that have no text layer?
pdfplumber won't work on scanned PDFs because they lack a text layer. Run OCR first using Tesseract with bounding box output enabled. Tesseract produces the same text-plus-coordinates JSON structure as pdfplumber, so the second half of the pipeline — sending tokens to Claude with a column-by-column prompt — remains identical. Only the extraction step changes.
Why does markdown conversion still produce errors on complex PDFs?
Markdown conversion reduces column-merge errors from 38% to 22%, but it strips out the spatial information the model needs to distinguish columns. Markdown is a linear format, so once coordinates are discarded, Claude still has to guess at the original layout. Preserving the raw x0/y0/x1/y1 bounding boxes as JSON tokens is what drops the error rate to around 4%.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why does Claude merge columns when extracting data from multi-column PDF invoices?
Claude doesn't know where the columns are on the page. It reads tokens left-to-right, top-to-bottom across the entire page as if it were one paragraph, which glues descriptions from the left column to prices from the right column. The Files API has the same issue because it linearizes the PDF internally. The fix is giving Claude spatial bounding box information for each word.
How do I extract line items from multi-column PDF invoices accurately?
Use pdfplumber to open the PDF and call extract_words on each page. This returns each token with its text and bounding box coordinates (x0, y0, x1, y1). Dump that list as JSON and send it to Claude with a prompt instructing it to reconstruct reading order column by column, then return structured JSON. This approach reduces extraction errors from 38% to around 4%.
When should I use bounding box JSON vs native PDF upload with Claude?
Use bounding box JSON for any multi-column document like vendor invoices, where spatial layout matters. On 240 real vendor PDFs, native upload produced 38% column-merge errors versus 4% with bbox JSON. Native upload also burned ~14,000 input tokens per document versus ~5,000 for JSON, a 64% cost reduction. Native upload only works reliably on clean, single-column digital documents.
How do I handle scanned image PDFs that have no text layer?
pdfplumber won't work on scanned PDFs because they lack a text layer. Run OCR first using Tesseract with bounding box output enabled. Tesseract produces the same text-plus-coordinates JSON structure as pdfplumber, so the second half of the pipeline — sending tokens to Claude with a column-by-column prompt — remains identical. Only the extraction step changes.
Why does markdown conversion still produce errors on complex PDFs?
Markdown conversion reduces column-merge errors from 38% to 22%, but it strips out the spatial information the model needs to distinguish columns. Markdown is a linear format, so once coordinates are discarded, Claude still has to guess at the original layout. Preserving the raw x0/y0/x1/y1 bounding boxes as JSON tokens is what drops the error rate to around 4%.