Don't Upload PDFs to Claude — Do This Instead (8x Cheaper)

By Lazar Milicevic · Published June 17, 2026 · 8 min read

Same invoice. Same model. Same extraction prompt. One path costs 8,200 tokens, the other costs 1,400. If you're processing invoices, receipts, or contracts at any kind of volume, the difference is hiding in your monthly bill and most tutorials don't even mention it.

I build invoice automation pipelines for accounting clients. Here's exactly what's happening and how to fix it.

What Claude Actually Does When You Upload a PDF

When you drag a PDF into the Claude UI or send it as a document block via the API, Claude doesn't just read the text layer. It does three things in parallel:

Re-renders every page as a raster image
Runs vision on those images
Extracts the text layer

You pay for all three. For a scanned receipt, the vision pass earns its keep — there's no text layer to extract. For a clean digital invoice exported from QuickBooks, Fakturko, or any modern ERP, the vision pass is pure waste. You already have perfect selectable text. You're paying tokenized image tiles to look at it.

Run the math on a single supplier invoice I tested last week:

Path	Input tokens	Cost per doc (Sonnet)
Direct PDF upload	8,200	~$0.025
Preprocessed plain text	1,400	~$0.004

That's a 5.8x reduction on the same document. Multiply by 800 invoices/month for one accounting client and you go from $168/month to $20/month. No model change, no prompt change, no caching tricks.

The 25-Line Preprocessor

No vector database. No chunking strategy. No LangChain. Just pdfplumber, a loop, and structured section headers so the model knows where it is in the document.

import pdfplumber
from pathlib import Path

def pdf_to_structured_text(pdf_path: str) -> str:
    out = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, start=1):
            out.append(f"=== PAGE {i} ===")

            text = page.extract_text() or ""
            if text.strip():
                out.append("INVOICE_HEADER")
                out.append(text)

            tables = page.extract_tables()
            for t_idx, table in enumerate(tables, start=1):
                out.append(f"LINE_ITEMS_TABLE_{t_idx}")
                for row in table:
                    cleaned = [(c or "").strip() for c in row]
                    out.append(" | ".join(cleaned))

    return "\n".join(out)

# Critical: write with explicit UTF-8 if you persist the intermediate
Path("invoice.txt").write_text(
    pdf_to_structured_text("invoice.pdf"),
    encoding="utf-8"
)

Then your Claude call is boring on purpose:

import anthropic
client = anthropic.Anthropic()

structured = pdf_to_structured_text("invoice.pdf")

msg = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=1024,
    messages=[{
        "role": "user",
        "content": f"Extract supplier, VAT ID, totals, and line items as JSON.\n\n{structured}"
    }],
)

That's the whole fix. The section headers (INVOICE_HEADER, LINE_ITEMS_TABLE_1, TOTALS) aren't decoration — they replace the spatial layout cues Claude was reconstructing from the image pass. Give the model structure in text and it stops guessing.

What this skips on purpose

No OCR step (digital PDFs already have a text layer)
No vector embeddings (single-document extraction, not retrieval)
No fancy framework (one dependency, pdfplumber)

The UTF-8 Trap That Will Eat Your VAT Numbers

This is the part no tutorial covers because most tutorials test on clean English invoices.

If you're processing documents in Serbian, Croatian, Czech, German, Polish, Hungarian — anything with characters outside basic ASCII — pdfplumber will sometimes hand you back text in your system's default encoding. On Windows that's often cp1252. The characters ć, č, š, đ, ž get mangled into garbage bytes the moment you write them to disk or pass them through a layer that re-encodes.

What happens next is the dangerous part: Claude doesn't error. Claude silently "corrects" the mangled characters into plausible Latin substitutions. A VAT ID RS123456789 on a supplier line for Đorđević d.o.o. came back with the company name rewritten as Dordevic d.o.o. and — here's the part that made me lose half a day — random digit substitutions in adjacent numeric fields because the tokenizer got confused by the broken byte sequences.

The fix is one keyword argument:

# When writing intermediate files
Path("invoice.txt").write_text(structured, encoding="utf-8")

# When reading back
text = Path("invoice.txt").read_text(encoding="utf-8")

# When opening manually
with open("invoice.txt", "w", encoding="utf-8") as f:
    f.write(structured)

If you're shipping this on Windows, also set:

import sys
sys.stdout.reconfigure(encoding="utf-8")  # Python 3.7+

Skip this and you'll spend a week thinking your prompt is the problem. It isn't. Your bytes are broken before Claude ever sees them.

The Real Cost Spreadsheet From One Client

This is from one accounting client running the pipeline in production. 800 supplier invoices per month, mixed Serbian VAT format and EU formats (DE, AT, HU), mixed currencies (RSD, EUR).

Metric	Direct PDF upload	Preprocessed text
Avg input tokens per doc	8,200	1,400
Cost per doc (Sonnet 4.5)	$0.025	$0.004
Monthly volume	800	800
Monthly Anthropic bill	$168	$20
Extraction accuracy	91%	96%
Avg latency per doc	6.8s	2.1s

Two things to notice.

First, accuracy went up, not down. Structured plain text with explicit section markers removes the model's guesswork about whether something is a header, a line item, or a totals row. When you let Claude reconstruct layout from a rendered page image, it occasionally misreads which column is "quantity" vs "unit price" on dense tables. Pre-flatten the tables and that ambiguity disappears.

Second, latency dropped by 3x. Vision passes are slow. Skipping them isn't just cheaper — it makes batch jobs finish in a third of the time, which matters when you're processing yesterday's invoices before the 9 AM reconciliation run.

When you should still upload the PDF directly

Scanned documents (no text layer to extract)
Handwritten notes or annotations on receipts
Documents where checkbox state matters (visual inspection)
Stamped/signed contracts where signature presence is part of the extraction

For everything else — anything exported digitally from accounting software, an ERP, or a webshop — preprocess.

A Production-Grade Version You Can Drop In

The 25-line version teaches the concept. Here's the version actually running for the client, with table detection fallback and a totals heuristic:

import pdfplumber
import re
from pathlib import Path

TOTAL_KEYWORDS = ("total", "ukupno", "iznos", "summe", "gesamt")

def pdf_to_structured_text(pdf_path: str) -> str:
    out = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, start=1):
            out.append(f"=== PAGE {i} ===")

            text = (page.extract_text() or "").strip()
            if not text:
                out.append("[NO TEXT LAYER — consider OCR fallback]")
                continue

            # Header: first 8 lines usually carry supplier + invoice meta
            lines = text.splitlines()
            out.append("INVOICE_HEADER")
            out.extend(lines[:8])

            # Tables
            tables = page.extract_tables()
            for t_idx, table in enumerate(tables, start=1):
                out.append(f"LINE_ITEMS_TABLE_{t_idx}")
                for row in table:
                    cleaned = [(c or "").strip() for c in row]
                    if any(cleaned):
                        out.append(" | ".join(cleaned))

            # Totals heuristic — lines containing total keywords
            totals = [
                ln for ln in lines
                if any(kw in ln.lower() for kw in TOTAL_KEYWORDS)
            ]
            if totals:
                out.append("TOTALS")
                out.extend(totals)

    return "\n".join(out)

if __name__ == "__main__":
    result = pdf_to_structured_text("invoice.pdf")
    Path("invoice.txt").write_text(result, encoding="utf-8")
    print(f"Extracted {len(result)} chars")

This handles roughly 94% of the supplier invoices the client receives without modification. The remaining 6% (mostly scanned PDFs from one specific supplier who still uses a fax-to-PDF pipeline) fall back to Claude's vision path, which is fine — they're the exception, not the rule.

Why bizflowai.io helps with this

Document extraction at volume is one of the categories where bizflowai.io already builds and runs pipelines for clients — invoice ingestion, supplier matching, line-item extraction into accounting systems, with preprocessing layers like the one above already baked in so the LLM bill doesn't scale linearly with document count. If you're processing more than a few hundred PDFs a month and the per-document cost is creeping into three figures, the fix is almost never a different model. It's the 25 lines before the API call.

Frequently asked questions

Why is uploading PDFs directly to Claude so expensive?

When you send a PDF directly to Claude, it re-renders every page as an image, runs vision processing on it, and also extracts the text layer. You pay for both passes. A typical invoice consumes around 8,200 input tokens (about 2.5 cents at Sonnet pricing). For clean digital invoices with selectable text, the vision pass is pure waste.

How do I reduce Claude API costs when processing PDF invoices?

Preprocess PDFs with pdfplumber in about 25 lines of Python. Open the PDF, loop through pages, extract text and tables separately, and format them as a structured plain-text block with clear section headers like INVOICE_HEADER, LINE_ITEMS, and TOTALS. Send that string to Claude via a normal API call. This drops token usage from roughly 8,200 to 1,400 per invoice.

Why does UTF-8 encoding matter when preprocessing PDFs for Claude?

If you process documents in Serbian, Cyrillic, German, Czech, or any non-ASCII language, you must explicitly set encoding to UTF-8 when writing the intermediate text. Otherwise pdfplumber returns mangled characters and Claude silently hallucinates corrections, like substituting Latin characters into VAT numbers. The fix is adding encoding='utf-8' when opening the file, but skipping it causes silent data corruption.

Does preprocessing PDFs reduce extraction accuracy?

No, accuracy improves. In a production case with 800 supplier invoices per month across mixed Serbian and EU formats, extraction accuracy rose from 91 percent to 96 percent after preprocessing. Structured plain text with clear section headers removes Claude's guesswork about document layout, so the model spends its reasoning on field extraction rather than visual interpretation.

How much can preprocessing save versus direct PDF upload to Claude?

For one accounting client processing 800 invoices monthly, direct PDF upload cost 168 dollars while preprocessed text cost 20 dollars, a savings of 148 dollars per month. Direct upload runs about 2.5 cents per document at 8,200 tokens; preprocessed runs about 0.4 cents at 1,400 tokens. The preprocessing script took one afternoon to write.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.