Claude's PDF Upload Corrupts Cyrillic Silently — Here's the Fix

Cyrillic characters fracturing inside a PDF upload stream, with a repair patch restoring them in Claude

A Serbian invoicing client called us two weeks after launch: half the vendor names in their accounting database were wrong. Not missing — wrong. The numbers, dates, and tax IDs were perfect. The names had been silently transliterated into Latin garbage by Claude's PDF extractor, and nobody noticed because the response looked confident and well-formed.

If you process invoices, contracts, or supplier docs in Cyrillic, Arabic, Greek, or any non-Latin script, this failure mode is probably already in your data. Here's exactly how I diagnosed it and the three-step pipeline that took one client batch from 71% to 93% field accuracy — and to 99%+ end-to-end with a human review queue.

The failure mode nobody warns you about

When you upload a PDF to Claude's document endpoint, it tries to extract the embedded text layer first. That's the fast, cheap path. The problem: a lot of PDFs from older accounting software in Serbia, Russia, Greece, and the Middle East embed Cyrillic or Arabic glyphs using custom font encodings that don't map cleanly to Unicode.

The text layer claims one thing. The visual rendering shows another. Claude's extractor falls back to a Latin approximation and returns a clean response.

Real examples from the client batch:

  • Петровић доо → returned as Petroviħ doo
  • Стара Планина → returned as Stara Planina with stray combining diacritics
  • АД Електродистрибуција → mangled into a half-Latin string with no warning

No error. No confidence drop. No log line saying "I gave up on the font encoding." This is invisible unless you read the output in the source language — which is exactly why English-only tutorials never surface it. ASCII PDFs round-trip cleanly and everyone moves on.

The first version we shipped passed a 20-invoice spot check. Totals matched. Tax IDs matched. We pushed to prod. The bookkeeper caught it because she actually reads Serbian.

Step 1: Render PDF pages to PNG at 200 DPI

Stop trusting the embedded text layer for anything non-Latin. Rasterize the PDF and treat it as an image problem.

In Python that's pdf2image wrapping Poppler:

from pdf2image import convert_from_path

def pdf_to_images(pdf_path: str, dpi: int = 200):
    # Returns a list of PIL Images, one per page
    return convert_from_path(pdf_path, dpi=dpi, fmt="png")

Why 200 DPI specifically:

  • 150 DPI: small fonts (line items, footer tax IDs) start dropping characters. Accuracy fell ~4 points on our test set.
  • 200 DPI: clean OCR on 8pt fonts, reasonable file sizes (~250-400 KB per page).
  • 300 DPI: roughly 2x the vision token cost for no measurable accuracy gain on invoices.

On the install side, you need Poppler available on the host. On Ubuntu/WSL:

sudo apt install poppler-utils

Step 2: Send images to Claude's vision endpoint, not document

This is the actual fix. Vision reads rendered pixels. It sees the glyph a human sees. It doesn't care that the PDF's font encoding table is lying about what character is at position 0x42.

import base64, io, anthropic

def image_to_b64(pil_img) -> str:
    buf = io.BytesIO()
    pil_img.save(buf, format="PNG")
    return base64.standard_b64encode(buf.getvalue()).decode("utf-8")

def extract_invoice(pil_pages, schema_prompt: str):
    client = anthropic.Anthropic()
    content = []
    for page in pil_pages:
        content.append({
            "type": "image",
            "source": {
                "type": "base64",
                "media_type": "image/png",
                "data": image_to_b64(page),
            },
        })
    content.append({"type": "text", "text": schema_prompt})

    resp = client.messages.create(
        model="claude-sonnet-4-5",
        max_tokens=2048,
        messages=[{"role": "user", "content": content}],
    )
    return resp.content[0].text

Cyrillic comes back as Cyrillic. Arabic comes back as Arabic. The font-encoding bug is bypassed entirely because we never touch the embedded text layer.

Step 3: Force a JSON schema with per-field confidence

This is the step most people skip, and it's what separates a 90% demo from a production pipeline. The model is going to be wrong sometimes. You need to know which fields it's unsure about so you can route those to a human instead of pushing garbage into the accounting system.

SCHEMA_PROMPT = """
Extract invoice data from the attached PDF pages. Return ONLY valid JSON
matching this schema. Preserve original script (Cyrillic, Latin, etc.)
exactly as it appears on the page.

{
  "vendor_name":     {"value": "string", "confidence": 0.0-1.0},
  "vendor_tax_id":   {"value": "string", "confidence": 0.0-1.0},
  "invoice_number":  {"value": "string", "confidence": 0.0-1.0},
  "invoice_date":    {"value": "YYYY-MM-DD", "confidence": 0.0-1.0},
  "total_amount":    {"value": "number",  "confidence": 0.0-1.0},
  "currency":        {"value": "string", "confidence": 0.0-1.0},
  "line_items": [
    {"description": "string", "qty": "number", "unit_price": "number",
     "line_total": "number", "confidence": 0.0-1.0}
  ]
}

Confidence reflects how certain you are the field is correctly read from
the image. Low confidence = ambiguous glyphs, occlusion, or unclear layout.
"""

Then validate server-side and route:

import json

CONFIDENCE_THRESHOLD = 0.8

def route_invoice(raw_json: str):
    data = json.loads(raw_json)
    low_conf_fields = [
        k for k, v in data.items()
        if isinstance(v, dict) and v.get("confidence", 1.0) < CONFIDENCE_THRESHOLD
    ]
    if low_conf_fields:
        return ("review_queue", data, low_conf_fields)
    return ("auto_post", data, [])

Vendor name below 0.8? Review queue. Tax ID below 0.8? Review queue. Everything clean? Straight to the books.

Real numbers from the 220-invoice client batch

These are not estimates. This is the actual measurement run against a labeled ground-truth set the bookkeeper hand-validated:

  • Native PDF upload to Claude: 71% field accuracy. Vendor names were the worst offender — roughly 60% wrong on the Cyrillic ones. Numbers were nearly perfect.
  • Tesseract OCR pre-pass → text into Claude: 89%. Tesseract chokes on mixed-script layouts and merges adjacent columns on dense invoices.
  • Render to PNG → Claude vision → JSON schema: 93% field accuracy on first pass.
  • Same pipeline + human review queue on confidence < 0.8: 99%+ end-to-end. About 14% of invoices hit the review queue. Bookkeeper clears the queue in roughly 20 minutes per day.

A few practical notes from running this in production:

  • Vision tokens cost more than text tokens. For high-volume clients, hash each page image and cache the extraction result — re-process only pages that changed.
  • If your PDFs are pure Latin script and come from modern software (Stripe, QuickBooks, Xero exports), native upload is still fine and cheaper. The render-to-image pipeline is specifically the fix for non-Latin scripts, scanned docs, and PDFs from legacy software.
  • Always log the raw model response alongside the parsed fields. When something breaks two months later, you need the original output to debug, not just the cleaned data.

The whole pipeline is about 80 lines of Python end-to-end. PDF to image, image to vision, structured JSON with confidence, validate, queue the uncertain ones. That's it.

Why bizflowai.io helps with this

Document ingestion for SMBs in non-English markets is one of the workflows I build for clients every week — Serbian invoicing companies, Greek logistics operators, regional accounting firms. The render-to-vision pipeline above is the default ingestion path in those builds, wired up to a review queue, an audit log of raw model output, and a webhook into whatever accounting or ERP system the client already runs. The goal is always the same: the bookkeeper trusts the numbers without reading every PDF herself.

Frequently asked questions

Why does Claude's native PDF parser corrupt Cyrillic and Arabic text?

Many PDFs from older accounting software embed non-Latin glyphs using custom font encodings that don't map cleanly to Unicode. Claude's PDF extractor reads the broken text layer and silently falls back to a Latin approximation, so Петровић доо becomes Petroviħ doo. There's no error or warning, and numbers stay correct, which makes the corruption invisible unless you read output in the source language.

How do I extract text from non-Latin PDFs using Claude accurately?

Use a three-step pipeline. First, convert each PDF page to a 200 DPI PNG using pdf2image and Poppler in Python. Second, base64-encode the images and send them to Claude's vision endpoint instead of the document endpoint, so it reads rendered pixels rather than the broken font encoding. Third, force a JSON schema with per-field confidence scores and route low-confidence results to human review.

What DPI should I use when converting PDFs to images for Claude vision?

200 DPI is the sweet spot for converting PDF pages to PNG before sending to Claude's vision endpoint. At 150 DPI, accuracy drops noticeably on small fonts. At 300 DPI, you roughly double token costs with almost no accuracy gain. 200 DPI balances OCR quality against vision token cost for typical supplier invoices and utility bills.

When should I use Claude vision instead of native PDF upload?

Use the render-to-image plus vision pipeline for non-Latin scripts like Cyrillic or Arabic, scanned documents, and PDFs generated by legacy software with broken font encodings. For pure Latin script PDFs that are well-formed, native PDF upload to Claude is fine and cheaper. Vision tokens cost more than text tokens, so reserve the heavier pipeline for documents that actually need it.

How accurate is Claude vision versus native PDF parsing for invoice extraction?

On a real batch of 220 supplier invoices with mixed Cyrillic and Latin script, native PDF upload to Claude hit 71% field accuracy, with vendor names being the worst offender. Tesseract OCR piped as text into Claude reached 89%. The PDF-to-image plus Claude vision pipeline hit 93%, and adding a human review queue for low-confidence fields pushed end-to-end accuracy above 99%.


Want more like this?

I publish practical AI automation, GenAI engineering, and faceless content workflows on YouTube every week.

Subscribe to bizflowai.io on YouTube — never miss a new tutorial.

Planning an AI automation project or need a second opinion on your architecture?

Connect with me on LinkedIn — Lazar Milicevic, GenAI Engineer & bizflowai.io Founder.

Visit bizflowai.io for our services, case studies, and AI consulting.

Frequently asked questions

Why does Claude's native PDF parser corrupt Cyrillic and Arabic text?

Many PDFs from older accounting software embed non-Latin glyphs using custom font encodings that don't map cleanly to Unicode. Claude's PDF extractor reads the broken text layer and silently falls back to a Latin approximation, so Петровић доо becomes Petroviħ doo. There's no error or warning, and numbers stay correct, which makes the corruption invisible unless you read output in the source language.

How do I extract text from non-Latin PDFs using Claude accurately?

Use a three-step pipeline. First, convert each PDF page to a 200 DPI PNG using pdf2image and Poppler in Python. Second, base64-encode the images and send them to Claude's vision endpoint instead of the document endpoint, so it reads rendered pixels rather than the broken font encoding. Third, force a JSON schema with per-field confidence scores and route low-confidence results to human review.

What DPI should I use when converting PDFs to images for Claude vision?

200 DPI is the sweet spot for converting PDF pages to PNG before sending to Claude's vision endpoint. At 150 DPI, accuracy drops noticeably on small fonts. At 300 DPI, you roughly double token costs with almost no accuracy gain. 200 DPI balances OCR quality against vision token cost for typical supplier invoices and utility bills.

When should I use Claude vision instead of native PDF upload?

Use the render-to-image plus vision pipeline for non-Latin scripts like Cyrillic or Arabic, scanned documents, and PDFs generated by legacy software with broken font encodings. For pure Latin script PDFs that are well-formed, native PDF upload to Claude is fine and cheaper. Vision tokens cost more than text tokens, so reserve the heavier pipeline for documents that actually need it.

How accurate is Claude vision versus native PDF parsing for invoice extraction?

On a real batch of 220 supplier invoices with mixed Cyrillic and Latin script, native PDF upload to Claude hit 71% field accuracy, with vendor names being the worst offender. Tesseract OCR piped as text into Claude reached 89%. The PDF-to-image plus Claude vision pipeline hit 93%, and adding a human review queue for low-confidence fields pushed end-to-end accuracy above 99%.