PDF Text Extraction Lost Me 29% of Invoice Data — Send

Abstract tech illustration: PDF Text Extraction Lost Me 29% of Invoice Data — Send

A small accounting client was drowning in 200 supplier invoices a day. They wanted Claude to extract vendor, date, line items, tax, total — straight into their database. We did what every tutorial recommends: pdfplumber, extract text, send to Claude. On 800 production invoices, that pipeline came back at 71% field accuracy. Here's why text extraction silently destroys your data, and the image-based pipeline that hit 94% on the same batch — for less money.

Why text extraction quietly corrupts real invoices

When pdfplumber (or any text extractor) opens a PDF, it reads the embedded text stream. That stream has no concept of layout. It doesn't know what a stamp is. It doesn't know that a handwritten note in the margin overrides the printed total. Merged cells get flattened. Two-column layouts get interleaved. Tables with thin or missing borders get reassembled in whatever reading order the PDF generator happened to emit.

The model never sees the document. It sees a corrupted transcript of the document.

On clean, machine-generated invoices from a single ERP, this works fine. On real-world supplier paperwork — the kind any SMB actually receives — it falls apart. Here's what we logged on the 800-invoice batch:

  • Totals matched to the wrong vendor when two invoices were stapled and scanned as one PDF.
  • Line items skipped rows when the table had alternating row shading interpreted as a separator.
  • VAT/tax field came back empty on roughly one in three documents because the label sat in a merged header cell.
  • Worst of all: silent failures. The model confidently returned plausible-looking numbers that were just wrong. No exception. No warning. Straight into the database.

If you're processing volume, silent failures are the bug that eats your weekends.

The fix: stop sending text, send the page

Claude Vision sees the actual layout. It sees the stamp. It sees that the handwritten "paid in cash 14th" overrides the printed "NET 30." It sees that the second column belongs to the second column. The fix is almost embarrassingly simple — render the PDF page as an image and send the image.

Here's the full pipeline in about 40 lines of Python.

import base64, json, io
from pdf2image import convert_from_path
from anthropic import Anthropic

client = Anthropic()

FIELDS_PROMPT = """Extract the following fields from this invoice.
Return ONLY valid JSON, no prose, no markdown fences.

{
  "vendor": str,
  "vendor_tax_id": str | null,
  "invoice_number": str,
  "issue_date": "YYYY-MM-DD",
  "due_date": "YYYY-MM-DD" | null,
  "line_items": [{"description": str, "qty": float, "unit_price": float, "amount": float}],
  "subtotal": float,
  "tax": float,
  "total": float,
  "currency": str
}
"""

def extract_invoice(pdf_path: str) -> dict:
    pages = convert_from_path(pdf_path, dpi=200)
    results = []

    for page in pages:
        buf = io.BytesIO()
        page.save(buf, format="PNG")
        b64 = base64.b64encode(buf.getvalue()).decode()

        msg = client.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=2000,
            messages=[{
                "role": "user",
                "content": [
                    {"type": "image", "source": {
                        "type": "base64",
                        "media_type": "image/png",
                        "data": b64
                    }},
                    {"type": "text", "text": FIELDS_PROMPT}
                ]
            }]
        )
        results.append(json.loads(msg.content[0].text))

    return results[0] if len(results) == 1 else merge_pages(results)

The four steps:

  • Render each PDF page to PNG at 200 DPI with pdf2image.
  • Encode the PNG as base64 and drop it into a Claude message as an image content block.
  • Prompt with explicit field names and a JSON schema. No prose. No explanations.
  • Validate the JSON downstream — that's the next section.

Why 200 DPI is the sweet spot

DPI matters more than people expect. Too low and small print, tax IDs, and stamp text turn into mush. Too high and you blow past the API's image size limit, plus you burn money on tokens for resolution the model can't use.

I tested four DPI settings on the same 100 invoices:

DPI Avg file size Field accuracy Cost per page
100 180 KB 78% $0.008
150 380 KB 89% $0.009
200 640 KB 94% $0.011
300 1.4 MB 94% $0.018

200 DPI gets you the accuracy ceiling without paying the 300 DPI tax. Above that, you're paying for pixels the model has already extracted everything from.

The validation layer that catches the last 6%

The image pipeline gets you to 94% out of the box. The remaining gap isn't model error — it's a mix of genuinely unreadable documents (coffee stains, torn corners, fax artifacts from 2003) and small arithmetic drift. You catch both with one validation pass:

def validate_invoice(inv: dict) -> tuple[bool, str]:
    # 1. Line items reconcile to subtotal
    line_sum = round(sum(li["amount"] for li in inv["line_items"]), 2)
    if abs(line_sum - inv["subtotal"]) > 0.01:
        return False, f"line items {line_sum} != subtotal {inv['subtotal']}"

    # 2. Subtotal + tax = total (within one cent for rounding)
    if abs(inv["subtotal"] + inv["tax"] - inv["total"]) > 0.01:
        return False, f"subtotal+tax != total"

    # 3. Required fields present
    for f in ("vendor", "invoice_number", "issue_date", "total"):
        if not inv.get(f):
            return False, f"missing {f}"

    return True, "ok"

Anything that fails goes into a human review queue. On a recent batch of 50 invoices, three got flagged. Two were legitimate rounding differences in the source document. One was a real vendor arithmetic error — the kind of catch that pays for the whole system.

The pattern matters: text extraction fails quietly. Vision fails loudly, where you can catch it.

The cost math nobody runs honestly

People assume text extraction is "free" because pdfplumber is open source. That's true only if every invoice is a native PDF with a clean text layer. In any real batch, at least half are scans — phone photos forwarded by suppliers, multifunction printer outputs, faxes. Those need OCR.

Here's the actual per-page cost comparison on a mixed batch:

Pipeline OCR call LLM call Total per page
pdfplumber + OCR fallback + Claude (text) $0.0150 $0.003 $0.018
pdf2image + Claude Vision $0 $0.011 $0.011

The vision route is ~40% cheaper and more accurate. One library, one API, no OCR layer to maintain, no fallback logic for "is this PDF scanned or native?" Render, encode, send, validate.

That's the part that surprised the client most. They expected better accuracy to cost more.

What this looks like in production

I drop a folder of 50 mixed invoices onto the script. It renders, encodes, sends, parses, validates. About 90 seconds later I have 50 clean JSON records and a flagged review queue. Throughput scales linearly with concurrency — I run 8 parallel workers on the client's box and process their daily 200-invoice batch in under 6 minutes.

A few production lessons worth stealing:

  • Multi-page invoices: send all pages in one message, not separate calls. The model needs to see page 2's totals in context with page 1's line items.
  • Rate limits: batch with asyncio.Semaphore(8) rather than threading. Cleaner backpressure.
  • Logging: store the raw model response and the parsed JSON. When something looks off three weeks later, you need both.
  • Don't fine-tune yet: 94% with vanilla Sonnet beats most fine-tuned document AI platforms I've benchmarked, at a fraction of the integration cost.

If you're processing any volume of messy real-world PDFs — invoices, receipts, contracts, delivery notes, lab reports — try this before you reach for a heavier stack. No fine-tuning. No vector database. No document AI platform with per-page enterprise pricing.

Why bizflowai.io helps with this

This exact pipeline — image-based PDF extraction with schema-validated JSON output and a human-in-the-loop review queue — is one of the workflows we deploy for accounting firms, distributors, and ops teams running thousands of supplier documents a month. We handle the rendering, the prompt engineering for your specific field schema, the validation rules tuned to your tolerance, and the integration into your accounting or ERP stack. If you're already drowning in invoices and tired of silent OCR failures, bizflowai.io is where we run this end-to-end.

Frequently asked questions

Why does text extraction fail on real-world PDF invoices?

Text extractors like pdfplumber read a PDF's embedded text stream, which ignores stamps, signatures, and handwritten margin notes. Merged cells get flattened, two-column layouts get interleaved, and bordered tables reassemble in random order. The model receives a corrupted transcript instead of the document. On 800 Serbian SMB invoices, this approach hit only 71% accuracy with silent failures — confidently wrong numbers and empty VAT fields on roughly one in three documents.

How do I extract structured data from messy PDF invoices using Claude?

Render each PDF page to PNG at 200 DPI using pdf2image, base64-encode the image, and send it to Claude Vision as an image content block alongside a structured prompt listing every field you need (vendor, tax ID, dates, line items, subtotal, VAT, total) requesting JSON only. Parse the response and validate that subtotal plus VAT equals total within one cent, flagging mismatches for human review.

When should I use Claude Vision instead of text extraction for PDFs?

Use Claude Vision for real-world business paperwork containing stamps, signatures, handwritten notes, merged cells, multi-column layouts, or scanned pages. Text extraction works fine for clean, machine-generated PDFs but falls apart on messy documents. For invoices, receipts, contracts, delivery notes, or lab reports, sending images skips OCR entirely and preserves layout context the text stream loses.

Why is the image pipeline cheaper than text extraction for PDFs?

Text extraction looks free but requires an OCR API call per page on scanned PDFs, which make up at least half of most real invoice batches. Combined with the Claude call, that totals roughly 1.8 cents per page. The pure image route uses only pdf2image and Claude Vision with no OCR layer, costing about 1.1 cents per page — better accuracy with fewer moving parts.

How accurate is Claude Vision for invoice data extraction?

On a batch of 800 Serbian SMB invoices, the image-based Claude Vision pipeline achieved 94% field accuracy compared to 71% for text extraction. The remaining 6% were genuinely unreadable documents — coffee stains, torn corners, fax artifacts — that a human would also flag. Critically, vision failures are loud and catchable via subtotal-plus-VAT validation, while text extraction failures are silent and confident.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

Why does text extraction fail on real-world PDF invoices?

Text extractors like pdfplumber read a PDF's embedded text stream, which ignores stamps, signatures, and handwritten margin notes. Merged cells get flattened, two-column layouts get interleaved, and bordered tables reassemble in random order. The model receives a corrupted transcript instead of the document. On 800 Serbian SMB invoices, this approach hit only 71% accuracy with silent failures — confidently wrong numbers and empty VAT fields on roughly one in three documents.

How do I extract structured data from messy PDF invoices using Claude?

Render each PDF page to PNG at 200 DPI using pdf2image, base64-encode the image, and send it to Claude Vision as an image content block alongside a structured prompt listing every field you need (vendor, tax ID, dates, line items, subtotal, VAT, total) requesting JSON only. Parse the response and validate that subtotal plus VAT equals total within one cent, flagging mismatches for human review.

When should I use Claude Vision instead of text extraction for PDFs?

Use Claude Vision for real-world business paperwork containing stamps, signatures, handwritten notes, merged cells, multi-column layouts, or scanned pages. Text extraction works fine for clean, machine-generated PDFs but falls apart on messy documents. For invoices, receipts, contracts, delivery notes, or lab reports, sending images skips OCR entirely and preserves layout context the text stream loses.

Why is the image pipeline cheaper than text extraction for PDFs?

Text extraction looks free but requires an OCR API call per page on scanned PDFs, which make up at least half of most real invoice batches. Combined with the Claude call, that totals roughly 1.8 cents per page. The pure image route uses only pdf2image and Claude Vision with no OCR layer, costing about 1.1 cents per page — better accuracy with fewer moving parts.

How accurate is Claude Vision for invoice data extraction?

On a batch of 800 Serbian SMB invoices, the image-based Claude Vision pipeline achieved 94% field accuracy compared to 71% for text extraction. The remaining 6% were genuinely unreadable documents — coffee stains, torn corners, fax artifacts — that a human would also flag. Critically, vision failures are loud and catchable via subtotal-plus-VAT validation, while text extraction failures are silent and confident.