47-Page Bank Statement in Claude: 12 Transactions Vanished

I dragged a real 47-page bank statement into Claude last week and asked for the transaction count. It came back with 172. The actual number was 184. No warning, no truncation flag, no "I may have skipped rows" disclaimer — just a clean, confident, wrong answer that a bookkeeper would paste straight into a client reconciliation.
If you're running long financial PDFs through Claude (or GPT, or Gemini — same failure mode), you are almost certainly shipping wrong numbers right now and not knowing it. Here's exactly what's happening and the 15 lines of Python that fixed it.
The failure: a silent 6.5% miss on a routine document
The test document was a real bank export, 47 pages, 184 transactions, the kind of statement an accounting client sends through every month. The prompt was deliberately boring:
"Give me the total debits and the transaction count."
Claude responded with a total that looked plausible and a count of 172. The true count was 184. Twelve transactions were missing somewhere in the middle of the document — not the last page, not a corrupted region, just rows the model decided weren't important enough to enumerate.
The dangerous part is not the wrong number. The dangerous part is the confidence:
- No error.
- No "the document is long, results may be incomplete."
- No
[truncated]marker. - No request to chunk the file.
If you're a solopreneur who isn't going to hand-verify a 47-page PDF row-by-row (and nobody does — that's why you're using AI in the first place), this number goes straight into a spreadsheet. Then into a tax filing. Then into a client report.
Why it happens: the visual token ceiling nobody documents
When you drag a PDF into Claude's UI or send it through the API as a document block, the model doesn't read it as text. Each page is rendered as a combination of image tokens plus extracted text. That dual representation is what makes native PDF upload feel magical on short documents — Claude can "see" the logo on a one-page invoice and read the stamp.
But that representation has a hard ceiling. Anthropic's documented PDF limit is around 100 pages and 32 MB, and well before you hit either bound, the model starts compressing. On dense, repetitive layouts — bank statements, ledger exports, supplier invoice tables — the compression is brutal:
- Tiny fonts get downsampled.
- Repeating row structures get pattern-matched ("rows 40-55 look like rows 20-39, skip").
- The model decides it has the gist and moves on.
You don't get an error because, from the model's perspective, nothing went wrong. It produced an answer. It's just an answer based on a summarized internal view of the document, not on every row.
Where this fails hardest
- Bank and credit card statements (dense tables, hundreds of rows)
- Multi-page contracts with numbered clauses
- Supplier invoices with long line-item lists
- Insurance policies and lease agreements
- Any PDF where the value is in structured data, not visual layout
The fix: parse first, reason second
The fix is not a smarter prompt. It is not "Claude Opus instead of Sonnet." It is not chain-of-thought. It is removing the visual token layer entirely and handing the model structured rows.
Fifteen lines of Python with pdfplumber:
import json
import pdfplumber
rows = []
with pdfplumber.open("statement.pdf") as pdf:
for page in pdf.pages:
for table in page.extract_tables() or []:
for row in table[1:]: # skip header
rows.append({
"date": row[0],
"description": row[1],
"debit": row[2],
"credit": row[3],
"balance": row[4],
})
with open("statement.json", "w") as f:
json.dump(rows, f)
That's the whole thing. The output for a 47-page statement is roughly 40 KB of JSON. You then hand that JSON to Claude with the same prompt:
from anthropic import Anthropic
client = Anthropic()
with open("statement.json") as f:
payload = f.read()
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"Here are bank statement rows as JSON:\n{payload}\n\n"
"Return total debits and total transaction count. "
"Count every row. Do not skip or summarize."
}],
)
print(msg.content[0].text)
Result on the same document: 184 of 184 transactions. Every row counted. The total reconciles to the cent.
The reason this works is dull and that's the point — there are no images to compress, no layout to guess at, no font-size penalty. The model sees a flat list of objects and counts them deterministically.
The numbers: 71% cheaper, 100% accurate
Same document, same prompt, two paths. Here is what it actually cost on Sonnet pricing:
| Approach | Transactions found | Tokens (in + out) | Cost per run |
|---|---|---|---|
| Native PDF upload | 172 / 184 | ~95k visual + text | $0.31 |
| pdfplumber → JSON → Claude | 184 / 184 | ~12k text only | $0.09 |
That's a 71% cost reduction and a jump from 93.5% accuracy to 100% accuracy. The accuracy delta is what matters — cost is just a bonus — but the cost compounds fast.
A single accounting client in my pipeline pushes roughly 2,000 pages of statements and invoices per month. At the native rate that's around $13/month per client in raw PDF token costs. At the JSON-first rate it's about $3.80. Across 20 clients you've turned a $260 monthly inference bill into $76, and more importantly you've stopped shipping wrong reconciliations.
The cost math at small scale
- 1 client (2,000 pages/mo): saves ~$110/year
- 10 clients: saves ~$1,100/year
- 20 clients: saves ~$2,200/year and removes the silent-accuracy risk entirely
The decision tree I use in production
I build document ingestion pipelines for accounting and bookkeeping clients every week. After enough silent-accuracy bugs, I converged on a simple rule that lives at the top of every pipeline:
def route_document(pdf_path):
page_count = get_page_count(pdf_path)
has_tables = detect_tables(pdf_path)
if page_count <= 3 and not has_tables:
return "native_upload" # logo invoice, signed letter, screenshot
if page_count <= 3 and has_tables:
return "extract_then_reason"
if page_count <= 10 and not has_tables:
return "native_upload" # short contract, short policy
return "extract_then_reason" # everything else
The plain-English version:
- Under 3 pages, no tables → native upload. The model's visual grounding is genuinely useful.
- Under 3 pages with tables → extract first. Cheap insurance.
- 3-10 pages, prose only → native upload is still fine.
- Anything over 10 pages, or anything with dense tables → always extract to JSON first. No exceptions.
This single decision tree eliminated almost every silent accuracy bug I used to see. Not "reduced" — eliminated. The remaining failures are now loud failures (parser can't read a scanned image, OCR confidence flags a row) which is exactly what you want. Loud failures get fixed. Silent failures get shipped to clients.
When native upload is still the right call
I'm not anti-native-PDF. For the right shape of document it's the better tool:
- A one-page invoice with a logo, stamp, and handwritten signature.
- A screenshot of an email or a chat thread.
- A short signed letter where the formatting carries meaning.
- A contract excerpt where you need to quote clauses verbatim and the layout matters.
The failure mode is specifically dense, repetitive, tabular data at length. That's where the visual token compression eats your rows. For anything else, native upload is fine and often better.
Why bizflowai.io helps with this
This is the exact category of work I automate for clients through bizflowai.io — document ingestion pipelines that route PDFs through the right extraction path before any LLM sees them, with row-level reconciliation checks that fail loudly instead of silently. For bookkeepers and small accounting teams that means bank statements, supplier invoices, and expense reports flow into clean structured data with 100% row coverage, the LLM only handles the reasoning step (categorization, anomaly detection, summary), and the inference bill drops to a fraction of what naive uploads cost.
Frequently asked questions
Why does Claude miss transactions when reading long PDF bank statements?
When you upload a PDF natively, Claude renders each page as image tokens plus extracted text. Long documents hit a visual token ceiling around 100 pages, and well before that the model starts compressing, summarizing, and skipping rows in dense tables. Bank statements with tiny fonts and hundreds of near-identical rows trigger this worst case, producing plausible-looking but wrong totals with no truncation warning.
How do I get accurate data extraction from a multi-page PDF with Claude?
Parse the PDF into structured JSON first, then send the JSON to Claude. Using about fifteen lines of Python with pdfplumber, open the PDF, loop the pages, call extract_tables on each, flatten rows into dicts with fields like date, description, debit, credit, and balance, then dump to JSON. Claude reads JSON deterministically because there are no images to compress or layouts to guess.
How much cheaper is JSON-first PDF processing versus native upload?
On a 47-page, 184-transaction bank statement, native PDF upload cost about 31 cents per query because you pay for image tokens on every rendered page. The JSON-first path cost about 9 cents — roughly a 71% reduction — because the structured payload is only about forty kilobytes. Across two thousand pages a month per accounting client, the savings compound quickly.
When should I use native PDF upload versus extracting to JSON first?
Use native PDF upload for short, visual, layout-heavy documents under three pages — a one-page invoice with a logo, a signed letter, or a screenshot. Extract to JSON first for documents over three pages that contain tables, and always extract first for anything over ten pages. This decision tree eliminates most silent accuracy bugs in document pipelines.
What types of documents benefit most from JSON extraction before LLM analysis?
Any document where value lives in structured rows or specific clauses: bank statements, multi-page contracts, supplier invoices with line-item tables, insurance policies, and lease agreements. These contain repeating layouts and dense tables that trigger compression and row-skipping during native PDF processing. Extracting structured data first, then reasoning over it, restores deterministic accuracy that native upload cannot guarantee.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why does Claude miss transactions when reading long PDF bank statements?
When you upload a PDF natively, Claude renders each page as image tokens plus extracted text. Long documents hit a visual token ceiling around 100 pages, and well before that the model starts compressing, summarizing, and skipping rows in dense tables. Bank statements with tiny fonts and hundreds of near-identical rows trigger this worst case, producing plausible-looking but wrong totals with no truncation warning.
How do I get accurate data extraction from a multi-page PDF with Claude?
Parse the PDF into structured JSON first, then send the JSON to Claude. Using about fifteen lines of Python with pdfplumber, open the PDF, loop the pages, call extract_tables on each, flatten rows into dicts with fields like date, description, debit, credit, and balance, then dump to JSON. Claude reads JSON deterministically because there are no images to compress or layouts to guess.
How much cheaper is JSON-first PDF processing versus native upload?
On a 47-page, 184-transaction bank statement, native PDF upload cost about 31 cents per query because you pay for image tokens on every rendered page. The JSON-first path cost about 9 cents — roughly a 71% reduction — because the structured payload is only about forty kilobytes. Across two thousand pages a month per accounting client, the savings compound quickly.
When should I use native PDF upload versus extracting to JSON first?
Use native PDF upload for short, visual, layout-heavy documents under three pages — a one-page invoice with a logo, a signed letter, or a screenshot. Extract to JSON first for documents over three pages that contain tables, and always extract first for anything over ten pages. This decision tree eliminates most silent accuracy bugs in document pipelines.
What types of documents benefit most from JSON extraction before LLM analysis?
Any document where value lives in structured rows or specific clauses: bank statements, multi-page contracts, supplier invoices with line-item tables, insurance policies, and lease agreements. These contain repeating layouts and dense tables that trigger compression and row-skipping during native PDF processing. Extracting structured data first, then reasoning over it, restores deterministic accuracy that native upload cannot guarantee.