Structured Extraction From Messy PDFs: A Guide

You finally got the client to send invoices instead of photos of invoices. You wired up an LLM, asked for JSON, and shipped it. Two weeks later, accounting flags that the totals are off by a cent on 4% of documents, the dates on European invoices are sliding by months, and one tax ID extracted as "see attached." Welcome to the actual job of PDF extraction.
This guide is field-by-field. Totals, dates, line items, tax IDs, and addresses each fail differently, and they need different prompts, different validators, and different fallback rules. The point is not to get to "looks correct." The point is to get to "wrong loudly, right silently."
Why "just throw it at GPT" stops working at scale
A single-shot prompt that returns a JSON blob works fine on the ten PDFs you tested. It breaks on the eleventh because:
- Scanned pages have OCR noise (
Ovs0,lvs1, merged columns). - Two-column layouts get read in the wrong order, so the model picks a number from a different invoice block.
- Currency symbols and thousands separators are locale-dependent.
- Line items wrap across pages, and the model silently drops the continuation.
- The model confidently hallucinates a tax ID in the right format when the field is genuinely missing.
The fix is not a bigger model. The fix is treating each field as its own small engineering problem with a prompt, a parser, a validator, and a fallback. The orchestration on top is boring on purpose.
A useful mental model:
PDF → text + layout → per-field extractor → per-field validator
↓
pass → write to record
fail → human queue with context
Everything below fills in that diagram.
Step 0: Get clean text before you prompt anything
Most extraction failures are upstream. If your text layer is garbage, no prompt will save you.
For born-digital PDFs, use a layout-aware extractor (pdfplumber, pymupdf, or a vendor service that returns blocks with bounding boxes). For scans, run OCR with a modern engine and keep the layout. Do not flatten everything to a single string until you have to.
import pymupdf # fitz
def extract_blocks(path: str) -> list[dict]:
doc = pymupdf.open(path)
blocks = []
for page_num, page in enumerate(doc):
for b in page.get_text("blocks"):
x0, y0, x1, y1, text, block_no, block_type = b
blocks.append({
"page": page_num,
"bbox": [x0, y0, x1, y1],
"text": text.strip(),
})
return blocks
Why blocks instead of one big string: when the LLM picks the wrong "Total" on a page with three totals (subtotal, tax, grand total), you want to be able to constrain the prompt to specific regions, or at minimum show the model the labels and amounts as structured pairs.
A quick sanity check: print the extracted text for ten messy PDFs and read it. If you cannot find the total with your eyes in the raw text, the model will not either.
Field 1: Totals (the deceptively easy one)
Totals fail in three ways: wrong number picked (subtotal instead of grand total), correct number with wrong currency, and correct number rounded incorrectly because of locale parsing.
Prompt the model for the candidates, not the answer. Then validate arithmetically.
EXTRACT_TOTALS_PROMPT = """
From the invoice text below, list every monetary amount that appears next to
a label. Return JSON with: label (verbatim), amount (string, as printed),
currency (ISO 4217 if determinable, else null).
Do not normalize the amount. Do not infer values that are not printed.
Text:
{text}
"""
Then in code:
from decimal import Decimal, InvalidOperation
def parse_amount(raw: str, locale_hint: str = "en") -> Decimal | None:
s = raw.replace(" ", "").replace("\u00a0", "")
# European: 1.234,56 → 1234.56
if locale_hint in ("de", "fr", "it", "es", "nl") and "," in s:
s = s.replace(".", "").replace(",", ".")
else:
s = s.replace(",", "")
try:
return Decimal(s)
except InvalidOperation:
return None
def pick_grand_total(candidates: list[dict]) -> dict | None:
priority = ["grand total", "total due", "amount due", "total", "balance due"]
for label in priority:
for c in candidates:
if label in c["label"].lower():
return c
return None
Then validate: subtotal + tax == total within one-cent tolerance. If it does not reconcile, do not write the record. Flag it.
| Failure mode | Detector | Action |
|---|---|---|
| Picked subtotal as total | Arithmetic check fails | Re-extract with stricter prompt |
| Locale-mis-parsed amount | Sanity range vs. line items | Retry with locale hint from country |
OCR 0/O confusion |
Decimal parse fails |
Send to human queue |
| Missing currency | currency is None |
Infer from issuer country, flag low confidence |
Rule of thumb I use: any total under $1 or over $10M for an SMB invoice gets a soft flag regardless of validator. Outliers are almost always parse errors.
Field 2: Dates (where locales bite)
02/03/2024 is March 2 in the US and February 3 nearly everywhere else. The model will guess based on training data bias, which means it will be wrong on a quietly meaningful fraction of your European documents.
Never let the LLM decide the date format. Extract the raw string, then parse with explicit rules.
EXTRACT_DATES_PROMPT = """
List every date that appears in the document with its label.
Return JSON: [{ "label": "...", "raw": "...", "page": N }].
Do NOT convert the date. Return it exactly as printed.
"""
from datetime import date
import re
def parse_date(raw: str, country: str | None) -> date | None:
# ISO is unambiguous
m = re.match(r"(\d{4})-(\d{2})-(\d{2})", raw)
if m:
return date(int(m[1]), int(m[2]), int(m[3]))
# Textual month is unambiguous
# "3 Feb 2024" / "Feb 3, 2024" — use dateutil with explicit dayfirst
from dateutil import parser
eu_countries = {"DE", "FR", "IT", "ES", "NL", "BE", "PL", "PT", "AT"}
dayfirst = country in eu_countries if country else False
try:
return parser.parse(raw, dayfirst=dayfirst).date()
except (ValueError, parser.ParserError):
return None
Validation rules that catch real bugs:
invoice_date <= due_date(or no due date present)invoice_dateis within a sane window: not before the company existed, not more than 60 days in the future- If multiple "invoice date" candidates exist, they must agree
A subtle one: if you see 01/02/03, refuse to parse. Send to human. The two-digit year ambiguity is not worth the bug.
Field 3: Line items (where models hallucinate quietly)
Line items are where LLMs lie most plausibly. They will invent a fourth row that does not exist, or merge two rows into one, and the JSON will look perfectly clean.
Two techniques that move the needle:
1. Extract the table structure first, then the rows. Use a layout-aware tool (pdfplumber.extract_tables(), or a vendor with explicit table detection). Pass the structured table to the LLM, not the raw text. The model's job becomes labeling columns, not finding rows.
2. Validate the sum. This is the cheapest, most powerful check in the entire pipeline:
def validate_line_items(items: list[dict], stated_subtotal: Decimal) -> bool:
computed = sum(
Decimal(str(i["quantity"])) * Decimal(str(i["unit_price"]))
for i in items
)
return abs(computed - stated_subtotal) <= Decimal("0.02")
If line items do not sum to the subtotal, one of three things happened: a row is missing, a row is duplicated, or a unit price is wrong. All three are unsafe to write. All three go to a human.
A schema I use:
{
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number",
"line_total": "number",
"source_page": "integer",
"source_bbox": [0, 0, 0, 0]
}
]
}
The source_page and source_bbox fields are not optional. When a human reviewer opens the failed document, they need to land on the exact spot. Round-trip review without bbox is a slog and your reviewers will start rubber-stamping.
Field 4: Tax IDs (where format is your friend)
VAT numbers, EINs, ABNs — these have strict formats. Use them.
| Country | Format | Validatable? |
|---|---|---|
| US EIN | NN-NNNNNNN |
Format only |
| UK VAT | GB + 9 or 12 digits |
Format + mod-97 checksum |
| EU VAT | Country code + national format | VIES API (real-time) |
| Australia ABN | 11 digits | Weighted checksum |
| Canada BN | 9 digits + program ID | Format only |
Never trust an LLM to validate a checksum. The model will tell you GB123456789 is valid because it looks like a UK VAT number. It is not — the checksum does not pass.
def validate_uk_vat(vat: str) -> bool:
s = vat.replace(" ", "").upper().removeprefix("GB")
if not s.isdigit() or len(s) not in (9, 12):
return False
digits = [int(c) for c in s[:7]]
weights = [8, 7, 6, 5, 4, 3, 2]
total = sum(d * w for d, w in zip(digits, weights))
check = int(s[7:9])
return (total + check) % 97 == 0 or (total + check + 55) % 97 == 0
For EU VAT numbers on B2B invoices that matter for tax, hit VIES. It is free, slow, and occasionally down — cache aggressively and treat outages as "unknown" rather than "valid."
If the tax ID does not pass format and checksum, the field is empty. Do not write a guess. An empty tax ID is recoverable; a wrong one creates a downstream tax filing problem.
Field 5: Addresses and party names
Addresses are the field where I most often see teams accept noisy data. "Close enough" addresses corrupt your customer database fast.
The pragmatic approach: extract the address as a single string, then run it through a real address validation API (Google, Smarty, Loqate) to normalize. The LLM is good at "find the billing address block" and bad at "is this a real street."
For party names (vendor name, customer name), the failure mode is variation: "Acme Corp", "Acme Corporation", "ACME CORP." all refer to the same entity. Do not let extraction decide. Pass through a deduplication step that matches against your existing customer table using normalized strings plus tax ID. The tax ID is the join key; the name is the human label.
The validation-and-fallback contract
Here is the rule I write at the top of every extraction project, and the reason most pipelines drift into chaos when it is missing:
Every field has a validator. Every field has a confidence. Every field below threshold goes to a human with the source bounding box.
A small schema enforces it:
fields:
invoice_number:
extractor: regex_then_llm
validators: [non_empty, max_length_64]
on_fail: human_review
invoice_date:
extractor: llm
validators: [parseable_date, within_sane_window, before_or_equal_due_date]
on_fail: human_review
total_amount:
extractor: llm
validators: [parseable_decimal, equals_subtotal_plus_tax, range_check]
on_fail: human_review
vat_number:
extractor: regex_then_llm
validators: [format_check, checksum_check, vies_lookup_when_eu]
on_fail: leave_empty_and_flag
line_items:
extractor: table_first_then_llm
validators: [sums_to_subtotal, all_rows_have_quantity_and_price]
on_fail: human_review
Two things to notice. First, on_fail differs by field — a missing VAT number is recoverable, a wrong line item is not. Second, the validators are mostly deterministic code, not LLM calls. The LLM extracts; code judges.
What "human review" means in practice: a queue item with the PDF, the failed field, the suggested value, and the bounding box. A reviewer should be able to confirm or correct in under 20 seconds per document. If your review UI is slower than that, your fallback is broken and people will start approving without looking.
A practical confidence model
You do not need a probabilistic ML model for this. A weighted rule-based score works well and is debuggable:
def field_confidence(field: str, value, validators_passed: list[bool]) -> float:
if not validators_passed:
return 0.0
base = sum(validators_passed) / len(validators_passed)
# Penalize fields known to be hard
hard_fields = {"line_items", "address"}
if field in hard_fields:
base *= 0.9
return round(base, 2)
Set the threshold per field per client. Finance teams want stricter thresholds than marketing ops. Make the threshold a config value, not a magic number in code, so you can tune it without a deploy.
How BizFlowAI approaches this
We build extraction pipelines where every field has a named validator and an explicit fallback path, because that is the only way to make the system safe enough to leave running unattended. The LLM is one component, not the whole pipeline — text extraction, layout parsing, regex, format checks, checksums, and arithmetic reconciliation all run before any value gets written to your record system. When a field cannot be trusted, it goes to a short human review queue with the PDF cropped to the relevant region. Volumes we see in production land in the 85–97% straight-through range depending on document quality; the rest gets reviewed in seconds rather than minutes.
If you want to see how this would work on your actual documents, bring three of your worst PDFs to a discovery call — the scanned one, the multi-page one with awkward line items, and the one in a language nobody on your team reads. We will walk through field-by-field where it would pass, where it would flag, and what the review workflow would look like.
Closing checklist
Before you ship a PDF extraction pipeline, confirm:
- You can read the raw extracted text and find every field with your eyes.
- Every field has a validator written in code, not in a prompt.
- Totals reconcile arithmetically before any record is written.
- Dates are parsed with an explicit locale, never inferred by the model.
- Tax IDs are checksum-validated, not pattern-matched.
- Line items must sum to the subtotal, no exceptions.
- Every failure routes to a human queue with the source bounding box.
- Confidence thresholds are config, not constants.
- You measured straight-through rate per client per month and it is trending up, not down.
Extraction is not glamorous. It is plumbing. But it is plumbing that, when done right, removes one of the most reliably tedious tasks from a small team's week — and when done wrong, quietly poisons the database it feeds. Build it field by field, validate everything, and fall back to a human without shame.
Frequently asked questions
How do I extract structured data from messy PDF invoices reliably?
Treat each field as its own engineering problem with a dedicated prompt, parser, validator, and fallback rather than asking an LLM for one big JSON blob. Start by getting clean layout-aware text using tools like pdfplumber or pymupdf, then extract candidates per field and validate them arithmetically or against known formats. Write records only when validation passes; route failures to a human review queue with page and bounding box context. This 'wrong loudly, right silently' approach scales far better than single-shot prompting.
Why does GPT extract the wrong total from invoices?
LLMs often pick the subtotal, tax, or another labeled amount instead of the grand total because invoices contain multiple monetary values and the model has no arithmetic grounding. They also mis-parse locale-specific formats like European '1.234,56' as 1.234 instead of 1234.56. The fix is to prompt for all labeled amounts as candidates, then programmatically pick the grand total by label priority and verify that subtotal plus tax equals total within a one-cent tolerance.
How should I handle ambiguous date formats like 02/03/2024 in PDF extraction?
Never let the LLM decide the date format. Have it return the raw printed string with its label, then parse it in code using explicit rules: ISO dates and dates with textual months are unambiguous, while numeric dates should be parsed with dayfirst=True for European countries. Validate that invoice_date precedes due_date and falls within a sane window. For two-digit year formats like '01/02/03', refuse to parse and send to human review.
How do I prevent LLMs from hallucinating line items in invoices?
Extract the table structure first using a layout-aware tool like pdfplumber.extract_tables(), then pass the structured table to the LLM so its job is labeling columns rather than finding rows. Validate that the sum of quantity times unit price across all line items matches the stated subtotal within a small tolerance. If the sums don't reconcile, a row is missing, duplicated, or priced wrong — all unsafe to write automatically. Always include source_page and source_bbox so human reviewers can jump to the exact location.
What's the best way to validate VAT numbers and tax IDs extracted from documents?
Use the strict formats these identifiers have rather than trusting the LLM. UK VAT numbers have a mod-97 checksum, Australian ABNs use a weighted checksum, and EU VAT numbers can be verified in real time against the VIES API. US EINs and Canadian BNs only support format-level validation. An LLM will confidently say a fake number like GB123456789 looks valid, so always run a deterministic checksum or API check in code.
Work with BizFlowAI
If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.
Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.
More guides like this on the BizFlowAI blog.
Frequently asked questions
How do I extract structured data from messy PDF invoices reliably?
Treat each field as its own engineering problem with a dedicated prompt, parser, validator, and fallback rather than asking an LLM for one big JSON blob. Start by getting clean layout-aware text using tools like pdfplumber or pymupdf, then extract candidates per field and validate them arithmetically or against known formats. Write records only when validation passes; route failures to a human review queue with page and bounding box context. This 'wrong loudly, right silently' approach scales far better than single-shot prompting.
Why does GPT extract the wrong total from invoices?
LLMs often pick the subtotal, tax, or another labeled amount instead of the grand total because invoices contain multiple monetary values and the model has no arithmetic grounding. They also mis-parse locale-specific formats like European '1.234,56' as 1.234 instead of 1234.56. The fix is to prompt for all labeled amounts as candidates, then programmatically pick the grand total by label priority and verify that subtotal plus tax equals total within a one-cent tolerance.
How should I handle ambiguous date formats like 02/03/2024 in PDF extraction?
Never let the LLM decide the date format. Have it return the raw printed string with its label, then parse it in code using explicit rules: ISO dates and dates with textual months are unambiguous, while numeric dates should be parsed with dayfirst=True for European countries. Validate that invoice_date precedes due_date and falls within a sane window. For two-digit year formats like '01/02/03', refuse to parse and send to human review.
How do I prevent LLMs from hallucinating line items in invoices?
Extract the table structure first using a layout-aware tool like pdfplumber.extract_tables(), then pass the structured table to the LLM so its job is labeling columns rather than finding rows. Validate that the sum of quantity times unit price across all line items matches the stated subtotal within a small tolerance. If the sums don't reconcile, a row is missing, duplicated, or priced wrong — all unsafe to write automatically. Always include source_page and source_bbox so human reviewers can jump to the exact location.
What's the best way to validate VAT numbers and tax IDs extracted from documents?
Use the strict formats these identifiers have rather than trusting the LLM. UK VAT numbers have a mod-97 checksum, Australian ABNs use a weighted checksum, and EU VAT numbers can be verified in real time against the VIES API. US EINs and Canadian BNs only support format-level validation. An LLM will confidently say a fake number like GB123456789 looks valid, so always run a deterministic checksum or API check in code.