Document Processing Pipelines: OCR Is the Easy Part

You've got a folder of 4,000 supplier invoices and a CFO asking why three of them got posted to the wrong GL account last month. The OCR worked. The numbers came out. Somewhere between "text extracted" and "row in the ERP," the pipeline lied to you with high confidence and nobody caught it.
This is the actual problem with document automation in 2025. Reading the document is a solved commodity — Textract, Document AI, Azure Document Intelligence, and the new vision-LLMs all do an acceptable job on most layouts. The hard part is everything downstream: deciding when to trust the output, when to escalate, how to validate against your own business rules, and how to keep a human in the loop without destroying the throughput gains that justified the project in the first place.
This post walks through how to build a document pipeline you'd actually let touch your ERP.
The stages nobody talks about
A real document pipeline has six stages. Most tutorials cover stage two and stop.
| Stage | What happens | Where it breaks |
|---|---|---|
| 1. Ingest | Email, SFTP, scan, portal upload | Duplicates, wrong file types, password-protected PDFs |
| 2. Extract | OCR + structure detection | Multi-page invoices, rotated scans, handwriting |
| 3. Normalize | Dates, currencies, vendor names, units | "01/04/24" — which calendar? Which year? |
| 4. Validate | Business rules, cross-field checks, lookups | Line items don't sum to total. VAT rate doesn't match country. |
| 5. Route | Auto-post, human review, or reject | Threshold logic that's too loose or too tight |
| 6. Reconcile | Posting result, audit trail, feedback | No loop back into the model when humans correct it |
Skip any of these and you end up with a pipeline that works in the demo and fails in production the day a vendor changes their invoice template.
Pick an extractor based on your document shape, not on hype
For structured forms with consistent layouts (W-2s, standard customs forms), a layout-aware model like Azure Document Intelligence's prebuilt models or Textract Queries will outperform a general LLM. They're cheaper, faster, and return bounding boxes you can use for human review.
For semi-structured documents that vary across hundreds of vendors (invoices, delivery notes, purchase orders), a vision-capable LLM (GPT-4o, Claude, Gemini) extracting against a JSON schema is now the pragmatic default. You write the schema once, you don't maintain templates per vendor, and confidence-style scoring can be approximated with a second-pass verification.
For unstructured documents (contracts, NDAs, terms & conditions), you're doing extraction + classification + sometimes summarization. This is LLM territory end-to-end, but you'll want clause-level evidence with source spans, not just answers.
A rough decision tree:
document_type:
structured_form:
extractor: layout_model # Textract / Document AI / Azure DI
confidence: native_scores
semi_structured_business_doc: # invoices, delivery notes
extractor: vision_llm + schema
confidence: verifier_pass + rule_checks
unstructured_legal:
extractor: llm_with_citations
confidence: span_grounding + human_review_threshold
Don't ask "which extractor is best." Ask "which extractor matches the document I'm processing right now."
Schemas are the contract — write them first
Before you call any model, write the JSON schema for the data you want. This forces you to answer business questions that everyone wants to skip: Do you store the vendor's tax ID as a string or a normalized object? Is due_date mandatory? What happens if there's no PO number?
{
"type": "object",
"required": ["vendor", "invoice_number", "issue_date", "total", "currency", "line_items"],
"properties": {
"vendor": {
"type": "object",
"required": ["name"],
"properties": {
"name": {"type": "string"},
"tax_id": {"type": ["string", "null"]},
"iban": {"type": ["string", "null"], "pattern": "^[A-Z]{2}[0-9]{2}[A-Z0-9]+quot;}
}
},
"invoice_number": {"type": "string"},
"issue_date": {"type": "string", "format": "date"},
"due_date": {"type": ["string", "null"], "format": "date"},
"currency": {"type": "string", "minLength": 3, "maxLength": 3},
"subtotal": {"type": "number"},
"tax_total": {"type": "number"},
"total": {"type": "number"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"required": ["description", "quantity", "unit_price", "line_total"],
"properties": {
"description": {"type": "string"},
"quantity": {"type": "number"},
"unit_price": {"type": "number"},
"line_total": {"type": "number"},
"tax_rate": {"type": ["number", "null"]}
}
}
}
}
}
Use this schema in two places: as the structured-output constraint when calling the model, and as the hard gate before anything moves downstream. If the model returns something that doesn't validate, that's a pipeline event — log it, route to review, don't try to "fix" it silently.
Confidence scores are mostly lies. Validation rules aren't.
Every OCR vendor will sell you a "confidence score." Treat them as a weak signal, not a decision boundary. A 0.97 confidence on an invoice total means nothing if the total doesn't equal the sum of the line items.
Real confidence comes from cross-field validation. For invoices, the rules you almost always want:
def validate_invoice(doc: dict) -> list[ValidationError]:
errors = []
# Arithmetic checks (allow 1 cent for rounding)
line_sum = sum(li["line_total"] for li in doc["line_items"])
if abs(line_sum - doc["subtotal"]) > 0.01:
errors.append(ValidationError("subtotal_mismatch",
f"Line items sum to {line_sum}, subtotal is {doc['subtotal']}"))
if abs(doc["subtotal"] + doc["tax_total"] - doc["total"]) > 0.01:
errors.append(ValidationError("total_mismatch",
"subtotal + tax != total"))
# Per-line arithmetic
for i, li in enumerate(doc["line_items"]):
expected = round(li["quantity"] * li["unit_price"], 2)
if abs(expected - li["line_total"]) > 0.01:
errors.append(ValidationError("line_arithmetic",
f"Line {i}: qty * price != line_total"))
# Date sanity
if doc.get("due_date") and doc["due_date"] < doc["issue_date"]:
errors.append(ValidationError("date_inversion",
"Due date before issue date"))
# Currency / country consistency
if doc["currency"] == "EUR" and doc.get("vendor", {}).get("tax_id"):
if not looks_like_eu_vat(doc["vendor"]["tax_id"]):
errors.append(ValidationError("vat_format",
"EUR invoice with non-EU-looking VAT ID"))
# External lookups
if not vendor_exists_in_master(doc["vendor"]["name"]):
errors.append(ValidationError("unknown_vendor",
"Vendor not found in master data"))
return errors
These checks catch the failures that confidence scores never will: the model read "1,234.56" as "1234.56" but also read a line item as "234.56" instead of "1,234.56" because the comma was smudged. Confidence: 0.94. Reality: off by a thousand. Arithmetic check: caught it.
A useful framing: confidence determines who reviews it. Validation determines whether it can be posted at all.
Build a three-lane routing system
After extraction and validation, every document falls into one of three lanes:
- Auto-post: Schema valid, all hard rules pass, soft signals strong (known vendor, expected amount range, no anomalies).
- Human review: Schema valid, but something is borderline — new vendor, total outside the typical range, missing optional but useful field.
- Reject / re-extract: Schema invalid, arithmetic broken, document type unclear.
The decision logic looks something like this:
def route(doc, validation_errors, soft_signals):
if validation_errors:
hard = [e for e in validation_errors if e.severity == "hard"]
if hard:
return Route.REJECT, hard
return Route.HUMAN_REVIEW, validation_errors
if soft_signals.vendor_is_new:
return Route.HUMAN_REVIEW, ["new_vendor"]
if soft_signals.amount_zscore > 3:
return Route.HUMAN_REVIEW, ["unusual_amount"]
if soft_signals.duplicate_candidate:
return Route.HUMAN_REVIEW, ["possible_duplicate"]
return Route.AUTO_POST, []
The numbers that matter here are your auto-post rate and your false-auto-post rate. You want the first as high as possible and the second at zero. Start conservative — auto-post only the safest 20-30% — and widen the gate as you watch the human-review queue and learn what's actually safe.
Make human review fast or it won't happen
The fastest way to kill a document pipeline is to make the human-review UI worse than the original manual process. If your reviewer has to open a PDF in one tab, your extraction JSON in another, and the ERP in a third, you've built nothing.
What a usable review UI shows on one screen:
- The source document, with bounding boxes on the extracted fields
- Each extracted field as an editable input
- The validation errors highlighted in red, with the rule that failed
- A side-by-side diff if this looks like a duplicate of a prior document
- One button: "Approve & post" or "Reject with reason"
Two design rules that pay back enormously:
- Click-to-locate: clicking a field scrolls the PDF to the bounding box. Reviewers verify in 2 seconds instead of 20.
- Capture corrections as training data: when a reviewer changes a field, log the original extraction, the correction, and the document. This dataset is gold for the next pass — for fine-tuning, for prompt examples, or just for finding which document templates are killing your accuracy.
A realistic target for a mature pipeline on semi-structured business docs is 60-85% auto-post, with the rest reviewed in under 30 seconds per document. The exact number depends on document diversity and how strict your hard rules are — don't take anyone's marketing claims at face value here.
Close the loop: posting, idempotency, and audit
The last mile is where pipelines lose data. Some non-negotiables:
Idempotency. Every document gets a stable hash (content hash + vendor + invoice number). The posting step checks the ERP for that hash before inserting. Reprocessing a batch should never create duplicates.
def post_to_erp(doc):
fingerprint = sha256(
f"{doc['vendor']['name']}|{doc['invoice_number']}|{doc['total']}"
)
if erp.exists(fingerprint=fingerprint):
return PostResult.SKIPPED_DUPLICATE
result = erp.create_invoice(doc, fingerprint=fingerprint)
audit.log(doc_id=doc["id"], action="posted", erp_id=result.id)
return result
Audit trail. Every document needs a record: which extractor version processed it, which schema version, what was extracted, what was corrected by a human, who approved it, when it was posted, what ERP ID came back. When finance asks "why did this hit the wrong account in March?", you need to answer in minutes, not days.
Dead-letter queue. Documents that fail extraction or rejection get a parking lot, not silent deletion. Review it weekly. It's where you find the next template you need to support and the next validation rule you forgot.
Feedback to the model. Every human correction is a labeled example. Build a quarterly review where you look at the top 10 fields most frequently corrected and decide: is this a prompt fix, a schema fix, a new validation rule, or a fine-tune?
How BizFlowAI approaches this
Document pipelines — invoices, contracts, delivery notes, customs paperwork — are one of the core things we build and run for clients. The pattern above is roughly what we ship: schema-first extraction, hard validation rules tied to the client's business logic, a three-lane router, and a fast review UI that captures corrections back into the system. We integrate to whatever's on the other end — Odoo, SAP Business One, NetSuite, QuickBooks, a custom internal system — and we own the boring parts: idempotency, audit, dead-letter handling, retries.
A discovery call scopes your specific case end to end: which documents, which volumes, which ERP, which rules, what your current error modes look like, and where the human review needs to sit. The output is a concrete plan with the auto-post target, the review workload estimate, and the integration points — not a generic proposal.
What to build first
If you're starting from scratch on a document pipeline, build in this order. Skipping steps is the most common reason these projects fail.
- Write the schema. Before any model call. Argue with finance about edge cases now, not in production.
- Build the validation rules. Arithmetic, date sanity, vendor lookup, currency consistency. These will outlive any model you choose.
- Pick an extractor and wire it to the schema. Use structured output. Reject anything that doesn't validate.
- Build the review UI. Even a crude one. Don't post anything to the ERP until a human has approved the first 100 documents.
- Add the router. Start with everything going to review. Widen the auto-post gate only when you've watched real data for at least a few weeks.
- Add posting with idempotency and audit. Then, and only then, turn on auto-post for the safest lane.
- Set up the feedback loop. Weekly review of the dead-letter queue. Monthly review of correction patterns.
OCR has been good enough for years. The pipelines that actually save people time are the ones built by someone who treated reading the document as the easy part — and put the engineering effort where it mattered: validation, routing, review, and the boring plumbing that keeps the ERP clean.
Frequently asked questions
What are the stages of a production document processing pipeline?
A real document pipeline has six stages: ingest (email, SFTP, scans), extract (OCR and structure detection), normalize (dates, currencies, vendor names), validate (business rules and cross-field checks), route (auto-post, human review, or reject), and reconcile (posting results and audit trail). Most tutorials only cover extraction and stop. Skipping any stage produces a pipeline that demos well but fails when a vendor changes their invoice template. Each stage has its own failure modes that need explicit handling.
Should I use a vision LLM or a layout model like Textract for invoices?
Use layout-aware models (Textract, Document AI, Azure Document Intelligence) for structured forms with consistent layouts like W-2s or customs forms, since they are cheaper, faster, and return bounding boxes. Use vision-capable LLMs (GPT-4o, Claude, Gemini) with a JSON schema for semi-structured business documents like invoices that vary across hundreds of vendors, because you avoid maintaining per-vendor templates. For unstructured legal documents, use an LLM that returns clause-level citations. Match the extractor to the document shape, not to hype.
Are OCR confidence scores reliable for deciding to auto-post invoices?
No. OCR confidence scores are a weak signal and should not be used as a decision boundary. A 0.97 confidence on an invoice total is meaningless if the total doesn't equal the sum of the line items. Real confidence comes from cross-field validation: arithmetic checks, date sanity, currency-country consistency, and vendor master lookups. Use confidence to decide who reviews a document, but use validation rules to decide whether it can be posted at all.
How should I route documents after extraction in an AP automation pipeline?
Use three lanes. Auto-post documents that pass schema validation, all hard rules, and have strong soft signals (known vendor, expected amount, no duplicates). Send to human review when the schema is valid but something is borderline, such as a new vendor, unusual amount (z-score > 3), or possible duplicate. Reject or re-extract when the schema is invalid or arithmetic is broken. Start conservative by auto-posting only the safest 20-30% and widen the gate as you learn what is actually safe.
Why should I define a JSON schema before calling the extraction model?
Writing the JSON schema first forces you to answer business questions like whether due_date is mandatory, how to store the vendor tax ID, and what to do when a PO number is missing. The schema then serves two purposes: as the structured-output constraint when calling the model, and as a hard gate before data moves downstream. If the model returns something that doesn't validate, log it as a pipeline event and route to review rather than silently fixing it. This prevents bad data from quietly reaching your ERP.
Work with BizFlowAI
If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.
Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.
More guides like this on the BizFlowAI blog.
Frequently asked questions
What are the stages of a production document processing pipeline?
A real document pipeline has six stages: ingest (email, SFTP, scans), extract (OCR and structure detection), normalize (dates, currencies, vendor names), validate (business rules and cross-field checks), route (auto-post, human review, or reject), and reconcile (posting results and audit trail). Most tutorials only cover extraction and stop. Skipping any stage produces a pipeline that demos well but fails when a vendor changes their invoice template. Each stage has its own failure modes that need explicit handling.
Should I use a vision LLM or a layout model like Textract for invoices?
Use layout-aware models (Textract, Document AI, Azure Document Intelligence) for structured forms with consistent layouts like W-2s or customs forms, since they are cheaper, faster, and return bounding boxes. Use vision-capable LLMs (GPT-4o, Claude, Gemini) with a JSON schema for semi-structured business documents like invoices that vary across hundreds of vendors, because you avoid maintaining per-vendor templates. For unstructured legal documents, use an LLM that returns clause-level citations. Match the extractor to the document shape, not to hype.
Are OCR confidence scores reliable for deciding to auto-post invoices?
No. OCR confidence scores are a weak signal and should not be used as a decision boundary. A 0.97 confidence on an invoice total is meaningless if the total doesn't equal the sum of the line items. Real confidence comes from cross-field validation: arithmetic checks, date sanity, currency-country consistency, and vendor master lookups. Use confidence to decide who reviews a document, but use validation rules to decide whether it can be posted at all.
How should I route documents after extraction in an AP automation pipeline?
Use three lanes. Auto-post documents that pass schema validation, all hard rules, and have strong soft signals (known vendor, expected amount, no duplicates). Send to human review when the schema is valid but something is borderline, such as a new vendor, unusual amount (z-score > 3), or possible duplicate. Reject or re-extract when the schema is invalid or arithmetic is broken. Start conservative by auto-posting only the safest 20-30% and widen the gate as you learn what is actually safe.
Why should I define a JSON schema before calling the extraction model?
Writing the JSON schema first forces you to answer business questions like whether due_date is mandatory, how to store the vendor tax ID, and what to do when a PO number is missing. The schema then serves two purposes: as the structured-output constraint when calling the model, and as a hard gate before data moves downstream. If the model returns something that doesn't validate, log it as a pipeline event and route to review rather than silently fixing it. This prevents bad data from quietly reaching your ERP.