Claude Misreads 42% of Cyrillic Invoices — One Python Trick

A founder forwards me a Gmail folder with 217 supplier invoices from last month. Half are Cyrillic, half are Latin, all are PDFs. The "obvious" Claude pipeline reads 58% of fields correctly — which sounds okay until you realize that means almost every invoice has at least one wrong number flowing into accounting. Here's the one preprocessing step that fixed it.
The failure mode nobody warns you about
The default workflow looks clean. Upload the PDF directly to Claude, ask for structured JSON, parse the response, write it to your accounting system. For pure English invoices generated by modern software, it works well enough that you ship it.
Then a non-English invoice hits the pipeline and you get this:
- Supplier name mangled — three Cyrillic characters (
С,Р,Е) silently swapped for visually identical Latin (C,P,E). The name now matches no supplier in your database, so it gets created as a new vendor. - Two line items merged into one — the tokenizer glued adjacent table cells. Quantity from row 2 ended up multiplied by unit price from row 3.
- VAT off by €12 — a decimal comma read as a thousands separator.
1,250.00became1250000. - Zero errors raised. The JSON is well-formed. The schema validates. The numbers are just wrong.
I ran this exact setup against a 100-invoice test set with mixed Cyrillic/Latin content. Field-level accuracy: 58%. Almost half the extracted fields wrong or partially wrong. That's not a tool — that's a liability sitting between Gmail and your books.
Why Claude's PDF reader silently breaks on mixed scripts
Most tutorials skip this. When you upload a PDF, Claude isn't looking at the page the way you do. It reads the embedded text layer plus some structural hints (font sizes, positions, table boundaries the PDF generator left behind).
That text layer is great when the PDF was born from modern software emitting clean Unicode in a single script. It falls apart in three common cases:
- Mixed scripts in the same cell — Cyrillic descriptions next to Latin SKU codes. The tokenizer makes confident substitutions between visually similar codepoints.
- Old accounting software — Serbian, Greek, Bulgarian SMB tools often emit PDFs with broken or partial font embedding. The text layer says one thing, the rendered glyphs show another.
- Scanned pages with an OCR layer underneath — the OCR did its best, but it hallucinated characters that look right but aren't. Claude trusts that layer.
In every case the model doesn't tell you it's guessing. It returns JSON. Confident, formatted, wrong.
The shortlist of red flags in any PDF pipeline
- Output JSON validates but supplier names don't match your vendor table
- VAT totals drift by exactly one decimal-place factor (10x, 100x, 1000x)
- Line item counts vary run-to-run on the same file
- Cyrillic/Greek/Arabic characters appear as Latin lookalikes in output
The fix: render to image, send as vision input
Stop sending PDFs. Render each page to a PNG at 200 DPI and send those images to Claude as vision input with a strict JSON schema in the prompt. That's the entire fix.
import base64
import json
from pathlib import Path
from pdf2image import convert_from_path
from anthropic import Anthropic
from io import BytesIO
client = Anthropic()
SCHEMA_PROMPT = """Extract the following fields from this invoice page.
Return strictly valid JSON matching this schema:
{
"supplier_name": "string",
"invoice_number": "string",
"issue_date": "YYYY-MM-DD or null",
"line_items": [
{"description": "string", "quantity": "number",
"unit_price": "number", "total": "number"}
],
"vat_amount": "number or null",
"grand_total": "number or null"
}
If a field is not present on this page, return null.
Do not guess. Do not invent values. Preserve original script
(Cyrillic stays Cyrillic, Latin stays Latin).
Return ONLY the JSON object, no prose."""
def extract_invoice(pdf_path: str) -> list[dict]:
pages = convert_from_path(pdf_path, dpi=200)
results = []
for page in pages:
buf = BytesIO()
page.save(buf, format="PNG")
img_b64 = base64.standard_b64encode(buf.getvalue()).decode()
msg = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
messages=[{
"role": "user",
"content": [
{"type": "image", "source": {
"type": "base64",
"media_type": "image/png",
"data": img_b64,
}},
{"type": "text", "text": SCHEMA_PROMPT},
],
}],
)
results.append(json.loads(msg.content[0].text))
return results
Twenty lines. Two dependencies (pdf2image wraps poppler — apt install poppler-utils on Ubuntu, brew install poppler on Mac — and the official anthropic SDK).
The reason this works: the vision model treats the page as a page. It sees the table as a table. It sees Cyrillic glyphs as Cyrillic because they look like Cyrillic — there's no broken text layer in the middle lying about what's there. And because you're pinning a JSON schema in the prompt, you get parseable output every call, not prose you have to regex apart.
Why 200 DPI, why per-page, and the cost math
Three knobs matter, and I tuned them on actual production traffic.
DPI = 200. I tested 150, 200, 300, and 400.
| DPI | Field accuracy | Token cost/page | Notes |
|---|---|---|---|
| 150 | 81% | $0.022 | Small digits in VAT column get fuzzy |
| 200 | 94% | $0.041 | Sweet spot — all glyphs sharp |
| 300 | 94% | $0.078 | No accuracy gain, ~2x cost |
| 400 | 93% | $0.121 | Slightly worse, image gets compressed |
200 DPI keeps decimal points and thin Cyrillic strokes sharp without burning tokens on noise.
One call per page, not one per PDF. A 4-page PDF as a single multi-image request occasionally bleeds context between pages — the model carries assumptions from page 1 into page 4. Per-page calls are independent, parallelizable with asyncio, and easier to retry on failure.
Force null over guessing. The single most important sentence in the prompt is If a field is not present, return null. Do not guess. Without it, Claude will fabricate plausible invoice numbers when the field is occluded. With it, your downstream code gets explicit null and routes that document to human review.
Real numbers from a production pipeline
Same 100-invoice test set, before and after the switch. Mixed Cyrillic/Latin, mix of digital and scanned PDFs, generated by four different accounting tools.
| Metric | Raw PDF upload | Per-page PNG + vision |
|---|---|---|
| Field-level accuracy | 58% | 94% |
| Cost per invoice | ~$0.31 | ~$0.04 |
| Avg processing time | 14.8 s | 6.1 s |
| Silent wrong values | ~40% of invoices | <1% |
| Explicit nulls (flagged for review) | 0% | 5% |
The cost drop surprises people. Vision input on a 200 DPI page image is dramatically cheaper than feeding Claude a multi-page PDF it has to internally parse, tokenize, and reason about. You're handing it a clean visual artifact instead of a structured document with a poisoned text layer.
The failure mode change matters more than the accuracy number. Before: confidently wrong values flowing into accounting. After: explicit null on the 5% of pages that are genuinely unreadable (water-damaged scans, fax-quality reproductions), and those get auto-routed to a human review queue. That's the difference between a demo and a system you can run a business on.
Where I still see failures in the 6%
- Handwritten amount overrides on printed forms
- Receipts photographed at extreme angles (fix: a deskew step before render)
- Two-column invoices where line items wrap across the fold
- Carbon-copy duplicates where the rear copy bled through
This pattern generalizes beyond Cyrillic
After shipping this for Serbian clients, I started using the same preprocessing in front of every PDF extraction job, regardless of language. Same accuracy jump on:
- Greek pharmacy receipts — mixed Greek/Latin SKUs
- Arabic contracts — right-to-left text layers that Claude's PDF reader rearranges badly
- Chinese-English shipping documents — CJK characters in a Latin-dominant template
- Messy English invoices from 15-year-old accounting software — where the text layer is technically Latin but corrupted by font embedding bugs
The rule I now ship with: if the PDF text layer is unreliable for any reason, render to image and use vision. It's not a language-specific fix. It's a fix for the gap between what a PDF claims to contain and what's actually printed on the page.
The few places I still send raw PDFs: clean digital invoices from large modern SaaS vendors (Stripe, AWS, Google Workspace), where the text layer is bulletproof and per-page rendering is just wasted compute.
Why bizflowai.io helps with this
This exact preprocessing layer sits in front of every invoice extraction pipeline I build at bizflowai.io — Gmail watcher pulls PDFs as they arrive, pdf2image renders pages at 200 DPI, Claude vision returns schema-validated JSON, low-confidence pages get auto-routed to a Telegram review queue, and clean extractions write straight into the client's accounting system (Fakturko, QuickBooks, or a Google Sheet for the smaller setups). Most clients are processing 100–300 supplier invoices a month with one human reviewing the 5% of flagged edge cases instead of retyping all 300.
Frequently asked questions
Why does Claude extract non-English invoices incorrectly from PDFs?
Claude's PDF reader doesn't see the page visually — it extracts the embedded text layer plus structural hints. When invoices contain mixed Cyrillic and Latin scripts, were generated by older accounting software, or have scanned OCR layers, the tokenizer makes confident guesses: substituting similar-looking characters, merging table columns, and misreading decimal separators. It never warns you, just returns beautifully formatted wrong numbers.
How do I extract structured data from multilingual PDF invoices using Claude?
Stop sending raw PDFs. Use pdf2image (which wraps poppler) to render each page as a PNG at 200 DPI, then send each image to Claude as vision input with a structured extraction prompt. Include the JSON schema directly in the prompt — fields like supplier name, invoice number, ISO date, line items array, VAT, and grand total — and instruct Claude to return null for missing fields rather than guessing.
What accuracy improvement does image-based extraction give over PDF upload?
On a 100-invoice test set with mixed Cyrillic and Latin scripts, raw PDF upload to Claude achieved only 58% field-level accuracy. Switching to per-page PNG rendering at 200 DPI with a structured JSON schema prompt raised accuracy to 94%. Cost dropped from roughly 31 cents per invoice to about 4 cents, and average processing time fell from under 15 seconds to about 6 seconds per document.
Why is 200 DPI the right resolution for invoice page rendering?
200 DPI is the sweet spot for converting PDF pages to PNG images for Claude's vision model. It's high enough that small digits — critical for amounts, VAT, and invoice numbers — stay sharp and legible. It's low enough that you don't burn extra tokens processing visual noise, keeping per-page vision costs dramatically lower than feeding a multi-page PDF the model has to chew through.
When should I use Claude's vision input instead of PDF upload?
Use vision input with rendered page images whenever documents contain non-Latin scripts, mixed scripts in the same cell, scanned pages with OCR layers, or PDFs from older software with unclean embedded fonts. PDF upload works fine for clean, English-only, software-generated invoices. For anything multilingual or structurally messy, image-based extraction avoids the broken text-layer problem entirely because Claude sees the page as a page.
Want more like this?
I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.
Subscribe to bizflowai.io on YouTube — never miss a new tutorial.
Planning an AI automation project or need a second opinion on your architecture?
Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.
Visit bizflowai.io for our services, case studies, and AI consulting.
Frequently asked questions
Why does Claude extract non-English invoices incorrectly from PDFs?
Claude's PDF reader doesn't see the page visually — it extracts the embedded text layer plus structural hints. When invoices contain mixed Cyrillic and Latin scripts, were generated by older accounting software, or have scanned OCR layers, the tokenizer makes confident guesses: substituting similar-looking characters, merging table columns, and misreading decimal separators. It never warns you, just returns beautifully formatted wrong numbers.
How do I extract structured data from multilingual PDF invoices using Claude?
Stop sending raw PDFs. Use pdf2image (which wraps poppler) to render each page as a PNG at 200 DPI, then send each image to Claude as vision input with a structured extraction prompt. Include the JSON schema directly in the prompt — fields like supplier name, invoice number, ISO date, line items array, VAT, and grand total — and instruct Claude to return null for missing fields rather than guessing.
What accuracy improvement does image-based extraction give over PDF upload?
On a 100-invoice test set with mixed Cyrillic and Latin scripts, raw PDF upload to Claude achieved only 58% field-level accuracy. Switching to per-page PNG rendering at 200 DPI with a structured JSON schema prompt raised accuracy to 94%. Cost dropped from roughly 31 cents per invoice to about 4 cents, and average processing time fell from under 15 seconds to about 6 seconds per document.
Why is 200 DPI the right resolution for invoice page rendering?
200 DPI is the sweet spot for converting PDF pages to PNG images for Claude's vision model. It's high enough that small digits — critical for amounts, VAT, and invoice numbers — stay sharp and legible. It's low enough that you don't burn extra tokens processing visual noise, keeping per-page vision costs dramatically lower than feeding a multi-page PDF the model has to chew through.
When should I use Claude's vision input instead of PDF upload?
Use vision input with rendered page images whenever documents contain non-Latin scripts, mixed scripts in the same cell, scanned pages with OCR layers, or PDFs from older software with unclean embedded fonts. PDF upload works fine for clean, English-only, software-generated invoices. For anything multilingual or structurally messy, image-based extraction avoids the broken text-layer problem entirely because Claude sees the page as a page.