I Indexed 1,847 PDFs Once. Claude Answers in 1.4 Seconds.

By Lazar Milicevic · Published June 22, 2026 · 7 min read

Every time you drag a PDF into Claude, you're paying for the same extraction twice. If you've got a folder of invoices, contracts, or supplier docs and you're re-uploading them every chat just to ask one cross-document question, you're burning money on work the model already did last week. Here's the local pipeline I built for a client that indexes a folder once and lets Claude answer across all of it in a single call.

The pattern that costs solopreneurs 2-4 hours a week

A small invoicing business needs to answer one question: show me every invoice from one telecom supplier over $50,000 in Q3. So they open Claude, drag in six PDFs, wait for each to process, ask the question, get a partial answer because Claude only sees what's in the current chat, then drag in four more. Context window fills up, they start a new chat, do it again tomorrow.

That's 2-4 hours a week, every week, paying for the same extraction over and over. And the moment they want to ask a question across the whole archive — hundreds of past invoices — the upload approach completely breaks. You can't drag 800 PDFs into a chat. So the question stops getting asked, and the data stops being useful.

Every tutorial out there is built around one PDF and one question. Extraction accuracy gets all the attention. Nobody talks about retrieval across a real archive, which is where solopreneurs actually live. Generic SaaS document tools want $80/month per seat and still don't let you ask natural language questions across the full history without re-ingesting.

The fix is a local retrieval-augmented generation (RAG) index. Build it once. Query it forever. According to a 2024 Stanford AI Index report, inference costs for GPT-3.5-class queries fell over 280x between Nov 2022 and Oct 2024 — but the bigger cost in PDF workflows is no longer per-token. It's repeated extraction of the same document. RAG kills that completely.

Step 1: Walk the folder and extract text per page

Point a Python script at the folder where your PDFs already live. For this client that was 1,847 supplier invoices sitting in a single directory — no renaming, no manual sorting, no migration project.

Use pdfplumber and extract per page, not per document. Per-page chunks map back to a real location in a real document, which means when Claude answers it can cite the exact filename and page number. That's the audit trail that makes this safe for invoicing and contract work.

import pdfplumber
from pathlib import Path

def extract_pages(pdf_path: Path):
    pages = []
    with pdfplumber.open(pdf_path) as pdf:
        for i, page in enumerate(pdf.pages, start=1):
            text = page.extract_text() or ""
            if text.strip():
                pages.append({
                    "file": pdf_path.name,
                    "page": i,
                    "text": text
                })
    return pages

all_chunks = []
for pdf in Path("./invoices").rglob("*.pdf"):
    all_chunks.extend(extract_pages(pdf))

One honest limitation: if your PDFs are scanned images, pdfplumber pulls nothing because there's no text layer. You need an OCR pass first — tesseract or paddleocr — before embedding. Build it as a pre-step that flags scanned files and runs OCR only on those, so you don't waste compute on PDFs that already have clean text. Half the supplier PDFs in the wild are scans and tutorials never warn you.

Step 2: Embed locally with sentence-transformers

Embed each page with sentence-transformers running locally. The model is small (all-MiniLM-L6-v2 is ~80 MB), fits on a normal laptop, no API calls, no per-token cost. For 1,847 invoices the full embedding pass took 11 minutes on a home server with WSL Ubuntu.

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")
texts = [c["text"] for c in all_chunks]
embeddings = model.encode(texts, batch_size=32, show_progress_bar=True)

Why local? Two reasons:

Cost. OpenAI's text-embedding-3-small runs $0.02 per 1M tokens. Cheap, but it adds up when you re-index nightly across thousands of pages and you still need an internet round trip per batch.
Data residency. These are client invoices with tax IDs and bank routing numbers. They never leave the machine. For US small businesses handling PII, that simplifies your obligations around state-level data laws (CCPA in California, plus the patchwork of others) considerably.

Step 3: Store in Chroma — one Python install, no cloud

Chroma is a single pip install chromadb. The database is just a folder on disk. No cloud account, no monthly bill, no API key.

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("invoices")

collection.add(
    ids=[f"{c['file']}_p{c['page']}" for c in all_chunks],
    embeddings=embeddings.tolist(),
    metadatas=[{"file": c["file"], "page": c["page"]} for c in all_chunks],
    documents=texts,
)

Disk footprint for 1,847 invoices (roughly 8,400 pages indexed): ~340 MB. That includes the raw text, the 384-dim embeddings, and Chroma's index. Backs up with a single tar -czf.

Why not just stuff everything in Claude's context?

Approach	Cost per query	Latency	Scales past 50 PDFs?
Drag PDFs into chat	~$0.04 + 3-5 min human time	30-90s upload + answer	No — context cap
Upload to Claude Files API per query	$0.02-0.10	5-15s	Yes, but you re-pay extraction
Local RAG (this post)	~$0.002	1.4s	Yes — tested to 50k+ pages

The RAG numbers are measured on this client's setup. Your mileage will vary with question complexity and top-k.

Step 4: Query — embed the question, pull top 10, then call Claude

This is the part most tutorials skip. When a question comes in, embed it with the same sentence-transformers model, pull the top 10 most relevant page chunks from Chroma, and only then call the Claude API with those chunks plus the question.

Claude never sees the whole archive. It sees 10 relevant pages, every time, whether the archive has 50 PDFs or 50,000.

import anthropic

def ask(question: str, k: int = 10):
    q_emb = model.encode([question])[0].tolist()
    results = collection.query(query_embeddings=[q_emb], n_results=k)

    context = "\n\n".join([
        f"[{m['file']} p.{m['page']}]\n{doc}"
        for doc, m in zip(results["documents"][0], results["metadatas"][0])
    ])

    client = anthropic.Anthropic()
    msg = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": (
                "Answer using only the invoice excerpts below. "
                "Cite filename and page for every figure.\n\n"
                f"{context}\n\nQuestion: {question}"
            )
        }],
    )
    return msg.content[0].text

Return the answer with source filenames and page numbers so the user can click straight to the original PDF if they need to verify. For invoicing or contract work, that audit trail is non-negotiable.

Step 5: Schedule indexing and measure the actual unit cost

Schedule the indexing script to run once a night against the incoming folder, so new PDFs join the index automatically and nobody ever drags a file into a chat again.

# /etc/cron.d/pdf_index
0 2 * * * cd /home/lazar/pdf-rag && /usr/bin/python3 index_new.py >> log 2>&1

index_new.py just diffs the folder against the IDs already in Chroma and only embeds new files. Nightly run for this client takes 4-30 seconds depending on how many new invoices arrived.

Measured unit cost on the real workload:

Query latency: 1.4 seconds average (0.2s embed + 0.1s Chroma + ~1.1s Claude).
Cost per query: ~$0.002 (about 1.5k input tokens of context + a short answer on Sonnet 4.5).
Old way: ~$0.04 per chat + 3-5 minutes of human drag-and-drop.

The client runs roughly 20 queries a day. Old way: ~$240/year in API + ~400 hours/year of human time. New way: ~$15/year in API + ~30 seconds per query. The local index paid for itself in the first week of use.

When this design breaks (and what to do)

Tables with multi-column financials. pdfplumber.extract_tables() instead of extract_text() for those pages, then serialize to Markdown before embedding.
Questions requiring math across many docs (e.g. "total spend Q3"). Top-10 retrieval can miss invoices. Add a metadata filter step that pulls every matching invoice by date range, then asks Claude to sum.
Scanned PDFs. OCR pre-pass as noted above. Flag scanned files in a separate manifest so you can audit OCR errors.

Why bizflowai.io helps with this

This exact pipeline — folder watcher, per-page chunking, local embeddings, Chroma store, RAG query layer with citations — is one of the standard automations I deploy through bizflowai.io for small businesses sitting on years of invoices, contracts, and supplier PDFs. The setup is one-time, runs on your own machine or a $20/month VPS, and the queries land in whatever interface the team already uses (Slack, Telegram, a small web UI). No per-seat SaaS fee, no documents leaving your infrastructure.

Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

What is the problem with using Claude chat to query a large PDF archive?

Claude's chat interface only sees PDFs uploaded into the current context window. For solopreneurs with hundreds or thousands of invoices, you can't drag 800 PDFs into a chat, the context fills up fast, and you pay to re-extract the same documents repeatedly. This wastes 2-4 hours weekly and makes archive-wide questions impossible to answer.

How do I build a local PDF search system for invoices using Claude?

Point a Python script at your PDF folder, extract text per page with pdfplumber, embed each page locally using sentence-transformers, and store vectors in a local Chroma database. On queries, embed the question, pull the top 10 relevant page chunks from Chroma, then send only those chunks plus the question to the Claude API. Schedule nightly indexing for new files.

Why does per-page PDF extraction matter for retrieval?

Extracting text per page creates natural chunks that map back to a specific location in a specific document. This means when Claude answers, it can cite the exact filename and page number, giving users an audit trail to click straight to the original PDF for verification. That traceability is essential for invoicing and contract work.

How much does a local Chroma + Claude query cost compared to chat uploads?

A local retrieval query costs about two tenths of a cent because Claude only processes 10 small page chunks instead of full PDFs. The traditional chat upload approach costs roughly 4 cents per chat plus 3-5 minutes of manual drag-and-drop. For daily queries, the local index pays for itself within the first week.

When should I use OCR before embedding PDFs?

Use OCR when your PDFs are scanned images rather than digital documents with a text layer. Tools like pdfplumber extract nothing from image-only PDFs because there's no embedded text. Run an OCR pass with tesseract or paddleocr as a pre-step that flags scanned files and processes only those, so you don't waste compute on PDFs that already have text.