Pre-Extract PDFs to CSV: 98% Accuracy on Invoice Tables

By Lazar Milicevic · Published June 19, 2026 · 8 min read

You wired Claude into your invoice intake. The first ten PDFs came back perfect, you shipped it, and two weeks later your bookkeeper found that one in three invoices has rows missing or quantities swapped. The model isn't broken. You're asking it to do vision work it shouldn't be doing.

The Failure Mode Nobody Tells You About

When you attach a PDF to Claude (or GPT, or Gemini — same story), the document gets processed as a visual artifact. For prose-heavy docs — contracts, reports, memos — that's the correct path. The model reads pixels, recovers text, and reasons over it.

Tables are different. A table is structure. Rows align because of vector coordinates inside the PDF, not because pixels happen to land at the same Y-axis. When the vision pass sees a cell that spans two columns (a merged header, a subtotal row, a "Total" line that bleeds across "Qty" and "Unit Price"), the model has to guess which column it belongs to.

On a clean grid, that guess is right. On a real supplier invoice with merged header cells, multi-line descriptions, and a subtotal row spanning three columns, the guess cascades. One wrong column assignment in row 3 corrupts every row below it.

I ran this on a batch of 1,000 real supplier invoices pulled from client inboxes (Serbian SMBs, mixed languages, mixed templates). Numbers:

Pipeline	Row-level accuracy	Avg latency	Tokens / invoice
Direct PDF → Claude Haiku	67%	14.1s	~9,400
pdfplumber → CSV → Claude Haiku	98%	3.2s	~2,700

67% row accuracy is unusable for accounting. It's not a "tune the prompt" problem — the structure is already lost before the model starts reasoning.

Why Pre-Extraction Wins

PDFs already contain the table structure. When a supplier's accounting software (SAP, QuickBooks, whatever generates their invoices) writes the PDF, it embeds vector coordinates for every cell. The grid is there. Vision throws that away and reconstructs it from pixels.

pdfplumber reads the vector coordinates directly. No guessing. The cells come out aligned because they were aligned in the source file.

The mental model I use with clients:

Probabilistic step: model guesses at structure from an image. Cheap to add, expensive when it's wrong.
Deterministic step: parser reads coordinates from the file. Boring, reliable, fast.

Every time you can move a step from probabilistic to deterministic, your pipeline gets faster, cheaper, and more reliable. This is the entire game with production AI — shrink the surface area the model has to guess at.

The Eight-Line Fix

Install once:

pip install pdfplumber

The minimum viable extractor:

import pdfplumber
import csv
import io

def pdf_to_csv(pdf_path: str) -> str:
    buf = io.StringIO()
    writer = csv.writer(buf)
    with pdfplumber.open(pdf_path) as pdf:
        for page in pdf.pages:
            for table in page.extract_tables():
                writer.writerows(table)
    return buf.getvalue()

That's it for the happy path. page.extract_tables() returns a list of lists of lists — tables, then rows, then cells — already aligned to the PDF's internal coordinate system.

Now hand the CSV to Claude as plain text instead of attaching the PDF:

from anthropic import Anthropic

client = Anthropic()

csv_text = pdf_to_csv("invoice_2024_0142.pdf")

prompt = f"""Here is an invoice table as CSV. Extract line items as JSON.

Schema per item:
- description (string)
- quantity (number)
- unit_price (number)
- line_total (number)

Ignore subtotal, tax, and total rows. Return only line items.

CSV:
{csv_text}
"""

resp = client.messages.create(
    model="claude-haiku-4-5",
    max_tokens=2000,
    messages=[{"role": "user", "content": prompt}],
)
print(resp.content[0].text)

That's the whole pipeline. The model is now doing what it's actually good at — mapping clean rows to a JSON schema — instead of reverse-engineering a grid from pixels.

Handling the Edge Cases You'll Hit in Production

The 40-line version I run for clients adds three things: scan detection, a retry, and per-page error isolation.

import pdfplumber
import csv
import io
import logging

log = logging.getLogger(__name__)

def pdf_to_csv(pdf_path: str) -> tuple[str, bool]:
    """
    Returns (csv_text, is_scanned).
    If is_scanned is True, fall back to native PDF upload or OCR.
    """
    buf = io.StringIO()
    writer = csv.writer(buf)
    found_any_table = False
    total_chars = 0

    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages):
            total_chars += len(page.extract_text() or "")
            try:
                tables = page.extract_tables()
            except Exception as e:
                log.warning("page %d extract failed: %s", i, e)
                continue

            for table in tables:
                # skip empty/garbage tables
                if not table or not any(any(c for c in row) for row in table):
                    continue
                found_any_table = True
                writer.writerows(table)
                writer.writerow([])  # blank line between tables

    # heuristic: a real digital PDF has text. A scan has ~0 chars.
    is_scanned = total_chars < 50 and not found_any_table
    return buf.getvalue(), is_scanned

The scan-detection heuristic is crude but works: if extract_text() returns essentially nothing and no tables came out, it's an image-based PDF. For those, fall back to native PDF upload (vision is the right tool for scans) or run OCR with pytesseract first.

Things that bite you in real batches

Multi-line descriptions: pdfplumber sometimes splits "Premium widget, 10-pack\nblue, plastic" across two rows. Add a post-process pass that merges rows where every column except description is empty.
Currency symbols inside cells: "€ 1.234,56". Strip and normalize before sending — or tell Claude in the prompt that decimals use commas.
Two tables on one page: the writer.writerow([]) separator helps Claude treat them as distinct blocks.
Headerless tables: pdfplumber doesn't know which row is the header. If your suppliers use consistent templates, hardcode the column names in the prompt.

When to Pre-Extract and When to Upload Direct

Not every document benefits from this. The rule I give clients:

If the value lives in the rows → extract first. Invoices, bank statements, supplier price lists, inventory exports, payroll runs, telco call logs, POS daily reports.
If the value lives in the paragraphs → upload direct. Contracts, NDAs, research reports, articles, anything with mixed prose-and-chart layouts.
Mixed documents → extract the tables, upload the PDF, send both. Tell the model which is which.

A practical matrix from the kinds of work I see weekly:

Document type	Right approach	Why
Supplier invoice	pdfplumber → CSV	Pure grid, structure must be exact
Bank statement	pdfplumber → CSV	Reconciliation needs zero row drift
Service contract	Native PDF upload	Meaning lives in clauses, not cells
Product catalog (PDF)	pdfplumber → CSV	Rows are the data
Scanned receipt	OCR → text → Claude	No vector data to extract
Annual report	Native PDF upload	Charts, prose, footnotes — vision wins

The Real Cost Difference at Volume

Take a small accounting firm processing 500 invoices a month. With Claude Haiku 4.5 pricing as of writing (~$1/M input tokens, ~$5/M output):

Direct PDF pipeline

500 invoices × 9,400 input tokens = 4.7M input tokens → ~$4.70
Plus output, plus 33% of invoices need manual correction
Real cost = $4.70 + ~55 hours/month of bookkeeper review

Pre-extraction pipeline

500 invoices × 2,700 input tokens = 1.35M input tokens → ~$1.35
98% accuracy means ~10 invoices need spot-checking
Real cost = $1.35 + ~2 hours of review

The token bill drops 71%. The labor bill drops by an order of magnitude. That's the difference between an automation that pays for itself in a month and one that quietly bleeds margin while looking impressive in a demo.

Why bizflowai.io helps with this

Invoice intake is one of the most common pipelines I build at bizflowai.io — supplier PDFs land in a shared inbox, get parsed with the pdfplumber + Claude approach above, post-processed into the client's accounting schema, and pushed into their ERP or bookkeeping tool. The deterministic-first pattern (parse the structure, then let the model reason over clean data) is the same playbook we use for bank statement reconciliation, supplier catalog ingestion, and lead-list deduping. The point is never "use AI for everything" — it's "use AI for the part only AI can do, and use boring code for everything else."

Frequently asked questions

Why does Claude extract invoice tables inaccurately from PDFs?

When you attach a PDF, Claude processes it as a visual document and guesses at table structure from pixels. On invoices with merged header cells, multi-line descriptions, or subtotal rows spanning multiple columns, those guesses cascade and corrupt rows below. Tests on 1,000 supplier invoices showed direct PDF upload achieved only 67% row-level accuracy with 14-second average latency, making it unusable for accounting workflows.

How do I extract invoice tables accurately before sending them to Claude?

Use pdfplumber, a free Python library. Install it with pip install pdfplumber, then loop through pdf.pages and call page.extract_tables() to get aligned rows and cells based on the PDF's vector coordinates. Convert each table to CSV using the csv module, then paste the CSV string directly into your Claude prompt with an instruction to extract line items as JSON. The full pipeline is about 40 lines of code.

When should I use pdfplumber extraction vs direct PDF upload to Claude?

Use pdfplumber extraction for tabular data like invoices, bank statements, price lists, and inventory exports—anything fundamentally a grid. Use direct PDF upload for contracts, research reports, or mixed-layout documents with charts and prose, where vision processing is appropriate. The rule: if the document's value lives in the rows, extract first; if the value lives in paragraphs, upload directly.

What accuracy and cost gains does pre-extracting tables provide?

On the same invoice, direct PDF attachment to Claude Haiku returned 12 line items with 2 missing and 4 wrong quantities in 14 seconds. The pdfplumber pipeline returned all 14 line items correctly in 3.2 seconds. Because CSV is roughly one-third the token size of a rendered PDF, you also pay 71% less per invoice, turning a margin-bleeding automation into one that pays for itself quickly.

What should I do when pdfplumber returns no tables from a PDF?

When pdfplumber returns nothing, the PDF is typically a scan rather than digitally generated text. In that case, fall back to Claude's native PDF upload or run OCR first to convert the image into text. For digitally generated invoices—the vast majority of what suppliers send—pdfplumber reliably extracts clean tables using the PDF's actual vector coordinates.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.