Your Claude PDF Bill Is 94% Too High — Here's The Fix

Q: Why does sending whole PDFs to Claude cost so much?

Each query re-sends the entire document as input tokens. A 40-page PDF is roughly 52,000 input tokens per question, costing about 16 cents per query at Claude's pricing. At 1,200 queries a month, that's around $192, even though the model only needs one paragraph to answer most questions. You're paying for 39 pages of context the model never actually uses.

Q: How do I reduce Claude API costs on PDF queries?

Use a three-step retrieval pipeline. First, chunk the PDF by page on ingestion using pypdf or pdfplumber, storing each page with metadata. Second, embed each page with a cheap model like Voyage or OpenAI's small embedding model, saving vectors in SQLite with sqlite-vec. Third, embed incoming questions, run similarity search, and send only the top three pages to Claude. This cuts costs by around 94%.

Q: When should I use SQLite instead of a managed vector database?

For solopreneurs and small agencies processing client documents, SQLite with the sqlite-vec extension handles tens of thousands of pages on a laptop without performance issues. You avoid Pinecone or Weaviate subscription costs, and the entire storage layer is a single file you can back up by copying. Managed vector databases only make sense at much larger scale.

Q: Does retrieval-based PDF querying improve accuracy?

Yes. When Claude receives 40 pages of context, it has to identify what's relevant, which increases the chance of drifting, hallucinating clauses from the wrong section, or quoting numbers from the wrong table. Pre-filtering to the three most relevant pages helps the model focus, cite the correct page, and avoid mixing up similar-looking sections. The same change that cuts costs also reduces errors.

Q: How much can a small business save by indexing PDFs before querying Claude?

On a real production system processing 1,200 monthly queries across client PDFs, monthly costs dropped from $192 to about $11, a 94% reduction. Direct upload used 52,000 input tokens per query at 16 cents. Indexed retrieval used roughly 3,200 input tokens for three relevant pages plus the question, costing under a cent. Savings compound as document volume grows.

By Lazar Milicevic · Published June 15, 2026 · 9 min read

Abstract tech illustration: Your Claude PDF Bill Is 94% Too High — Here's The Fix

You drop a 40-page contract into Claude, ask one question, and pay for 39 pages the model doesn't read. Do that 1,200 times a month and the bill stops being trivial. I'll walk through the exact preprocessing swap that took a real client's Anthropic bill from $192 to $11 — same model, better answers.

The cost pattern nobody catches until the invoice arrives

The scenario is familiar. A client sends a multi-page contract or a bank statement. You drop the PDF into Claude, ask "what's the late payment clause?" or "what was the total billed in March?" — and you get a clean answer. So you keep doing it. Ten queries become fifty. The Anthropic console starts showing numbers that don't match the value you're getting.

I saw this on a document automation system we run for a small invoicing business. Real numbers from their billing logs, not estimates:

Volume: ~1,200 queries/month across client PDFs
Average document: 40 pages, ~52,000 input tokens
Cost per query: ~$0.16 (Claude Sonnet input pricing)
Monthly bill: $192

The answers were fine. The pattern was the problem. Every single question shipped the entire PDF as input context. Asking "where's the payment date?" on a 40-page contract uploads 39 pages of boilerplate the model glances at and ignores. You're not paying for intelligence — you're paying to re-read the same legal preamble 1,200 times a month.

The model needs one paragraph on page 3. It's getting the other 39 pages for free, on your dime.

Where the waste hides

Repeated ingestion cost — same document, re-tokenized on every question
Quadratic attention cost on long context — you pay for tokens the model deprioritizes anyway
No reuse across users or sessions — even prompt caching only helps if you query the same doc repeatedly within the cache window

The three-step fix: chunk once, embed once, retrieve top-k

The fix is unglamorous and it's been around since the first RAG tutorials. Most people skip it because they assume you need Pinecone, Weaviate, or some managed vector DB to do it "properly." You don't. Here's the actual pipeline I run.

Step 1 — Chunk the PDF by page on ingestion. Once, when the document first arrives. Not every time a question comes in.

import pypdf
from pathlib import Path

def chunk_pdf_by_page(pdf_path: str, doc_id: str):
    reader = pypdf.PdfReader(pdf_path)
    pages = []
    for i, page in enumerate(reader.pages):
        text = page.extract_text() or ""
        if not text.strip():
            continue
        pages.append({
            "doc_id": doc_id,
            "page_number": i + 1,
            "source_file": Path(pdf_path).name,
            "text": text,
        })
    return pages

For scanned PDFs or messy layouts, swap pypdf for pdfplumber — slower, but better at tables. Run this once per document. Store the output.

Step 2 — Embed each page with a cheap model. Voyage's voyage-3-lite or OpenAI's text-embedding-3-small both work. We're talking fractions of a cent per document — a 40-page PDF embeds for roughly $0.0004. Store the vectors in SQLite using the sqlite-vec extension.

import sqlite3
import sqlite_vec
from openai import OpenAI

client = OpenAI()
db = sqlite3.connect("docs.db")
db.enable_load_extension(True)
sqlite_vec.load(db)

db.execute("""
    CREATE TABLE IF NOT EXISTS pages (
        id INTEGER PRIMARY KEY,
        doc_id TEXT, page_number INTEGER,
        source_file TEXT, text TEXT
    )
""")
db.execute("""
    CREATE VIRTUAL TABLE IF NOT EXISTS page_vecs
    USING vec0(embedding float[1536])
""")

def embed_and_store(pages):
    for p in pages:
        emb = client.embeddings.create(
            model="text-embedding-3-small",
            input=p["text"]
        ).data[0].embedding
        cur = db.execute(
            "INSERT INTO pages (doc_id, page_number, source_file, text) VALUES (?, ?, ?, ?)",
            (p["doc_id"], p["page_number"], p["source_file"], p["text"]),
        )
        db.execute(
            "INSERT INTO page_vecs (rowid, embedding) VALUES (?, ?)",
            (cur.lastrowid, sqlite_vec.serialize_float32(emb)),
        )
    db.commit()

No managed service, no monthly vector DB bill, no auth keys to rotate. The entire storage layer is a single .db file. Backups are cp docs.db docs.db.bak. SQLite with sqlite-vec comfortably handles tens of thousands of pages on a laptop.

Step 3 — Retrieve top-3 pages on every query and send only those to Claude.

import anthropic

claude = anthropic.Anthropic()

def answer(question: str, doc_id: str, k: int = 3):
    q_emb = client.embeddings.create(
        model="text-embedding-3-small",
        input=question
    ).data[0].embedding

    rows = db.execute("""
        SELECT pages.page_number, pages.text
        FROM page_vecs
        JOIN pages ON pages.id = page_vecs.rowid
        WHERE pages.doc_id = ?
        ORDER BY vec_distance_cosine(page_vecs.embedding, ?)
        LIMIT ?
    """, (doc_id, sqlite_vec.serialize_float32(q_emb), k)).fetchall()

    context = "\n\n".join(
        f"--- Page {pg} ---\n{txt}" for pg, txt in rows
    )

    msg = claude.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=512,
        messages=[{
            "role": "user",
            "content": (
                f"Here are 3 pages from document {doc_id}.\n\n"
                f"{context}\n\n"
                f"Question: {question}\n\n"
                f"Answer and cite the page number."
            ),
        }],
    )
    return msg.content[0].text

That's fifteen lines of retrieval code. No agents, no orchestration framework, no graph. Just embed-search-prompt.

The numbers, side by side

This is the same client, same documents, same questions. Pulled from billing logs after two weeks on each pipeline.

Metric	Direct PDF upload	Indexed retrieval
Input tokens per query	~52,000	~3,200
Cost per query	~$0.16	<$0.01
Monthly cost (1,200 queries)	$192	~$11
One-time embedding cost per doc	$0	~$0.0004
Storage	Re-upload every time	Single SQLite file
Page citation in answer	Sometimes wrong	Reliable

94% cost reduction. And the embedding cost is a rounding error — you'd need to embed 28,000 forty-page documents to spend $11 on ingestion.

The payback period for the engineering work is week one. By the end of month one, you've saved an order of magnitude more than the code took to write.

What this does not require

A vector database service (no Pinecone, no Weaviate, no Qdrant Cloud)
A reranker (helpful at scale, unnecessary at this volume)
A framework (LangChain/LlamaIndex add abstractions that hide the simple math)
GPU infrastructure (CPU embedding via API is fine for this scale)

Accuracy goes up, not down — and here's why

The part most cost-optimization posts skip: this swap improves answer quality. It sounds counterintuitive — surely more context is better? — but in production it's the opposite.

When you stuff 40 pages into a single prompt:

Claude has to figure out which section is relevant before it can answer
Repeated headers, footers, and boilerplate create "lookalike" passages
A clause from section 4.2 can get confused with section 14.2
Numbers from one table get attributed to another
Page citations become unreliable because the model doesn't know which page it pulled from

When you pre-filter to the three most relevant pages:

The model sees a small, focused context window
There's no competing wrong answer in the prompt
Page numbers in the citation match the pages you actually sent
Hallucination rate on numerical fields drops noticeably

On the invoicing client's data, we tracked one specific failure mode: pulling the wrong total from a multi-page statement where each page had its own subtotal. Direct upload hit that error roughly 1 in 12 queries. Indexed retrieval hit it on 1 in ~80 queries — and when it did, it cited the wrong page, which made the error trivially catchable in review.

You cut the bill and you cut the error rate with the same change.

Where this breaks, and how to patch it

Honesty about the limitations:

Questions that require synthesis across many pages. "Summarize the whole contract" or "list every payment obligation across the document" — top-3 retrieval will miss things. Fix: detect synthesis-style questions with a cheap classifier and route those to a larger context window (or do hierarchical summarization on ingestion and query the summaries).

Scanned PDFs. pypdf returns empty strings on image-only pages. Pipe through Tesseract or a vision model on ingestion. Pay the OCR cost once, never again.

Tables and forms. Page-level chunking can split a table across pages, or bury a key number in a wall of cells. For invoice automation specifically, I run a second extraction pass that pulls structured fields (amount, due date, vendor) into their own indexed records, separate from the page text.

Multi-document questions. "Compare these three contracts" — drop the doc_id filter and retrieve across the corpus, but expect to bump k to 6-9 pages.

Cache invalidation. When a document is updated, delete the old rows and re-embed. One DELETE + one re-ingest. Trivial because everything is keyed by doc_id.

None of these break the pipeline. They're routing decisions on top of it.

Why bizflowai.io helps with this

Document-cost blowups are one of the most common patterns I see when I audit client AI bills — invoicing, contract review, supplier statements, KYC docs. The pipeline above is roughly what runs in production for those clients: page-level chunking on ingestion, local SQLite vector store, top-k retrieval to Claude or GPT, with structured field extraction layered on top for invoice and statement automation. bizflowai.io builds these document pipelines end-to-end for solopreneurs and small teams who are watching their Anthropic or OpenAI bill creep up and want the same answers for a tenth of the cost.

Frequently asked questions

Why does sending whole PDFs to Claude cost so much?

Each query re-sends the entire document as input tokens. A 40-page PDF is roughly 52,000 input tokens per question, costing about 16 cents per query at Claude's pricing. At 1,200 queries a month, that's around $192, even though the model only needs one paragraph to answer most questions. You're paying for 39 pages of context the model never actually uses.

How do I reduce Claude API costs on PDF queries?

Use a three-step retrieval pipeline. First, chunk the PDF by page on ingestion using pypdf or pdfplumber, storing each page with metadata. Second, embed each page with a cheap model like Voyage or OpenAI's small embedding model, saving vectors in SQLite with sqlite-vec. Third, embed incoming questions, run similarity search, and send only the top three pages to Claude. This cuts costs by around 94%.

When should I use SQLite instead of a managed vector database?

For solopreneurs and small agencies processing client documents, SQLite with the sqlite-vec extension handles tens of thousands of pages on a laptop without performance issues. You avoid Pinecone or Weaviate subscription costs, and the entire storage layer is a single file you can back up by copying. Managed vector databases only make sense at much larger scale.

Does retrieval-based PDF querying improve accuracy?

Yes. When Claude receives 40 pages of context, it has to identify what's relevant, which increases the chance of drifting, hallucinating clauses from the wrong section, or quoting numbers from the wrong table. Pre-filtering to the three most relevant pages helps the model focus, cite the correct page, and avoid mixing up similar-looking sections. The same change that cuts costs also reduces errors.

How much can a small business save by indexing PDFs before querying Claude?

On a real production system processing 1,200 monthly queries across client PDFs, monthly costs dropped from $192 to about $11, a 94% reduction. Direct upload used 52,000 input tokens per query at 16 cents. Indexed retrieval used roughly 3,200 input tokens for three relevant pages plus the question, costing under a cent. Savings compound as document volume grows.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.