Liquid AI's LFM2.5-230M: A 230M Model That Extracts

Q: How do I run LFM2.5-230M locally for data extraction?

The simplest path is loading it from Hugging Face with the transformers library using AutoModelForCausalLM and AutoTokenizer, then prompting with a system message that defines your JSON schema. For phone or desktop deployment use Liquid's LEAP SDK or runtimes like llama.cpp or MLX with a Q4_K_M quantized build. Always set temperature to 0.0-0.2 for extraction work and use constrained decoding (Outlines or lm-format-enforcer) to guarantee schema-valid JSON output. Constrained decoding matters more with small models because they drift off-schema more easily.

Q: What are the limitations of a 230M parameter model?

Small models struggle with long-context reasoning across structurally complex documents like multi-clause contracts, and they lose accuracy on out-of-distribution layouts like handwritten forms or dense merged-cell tables. They often have narrower multilingual coverage than larger siblings, and accuracy can drift silently as your input distribution shifts. Benchmark claims rarely match real-world data, so always pilot on 500-1,000 of your actual documents before deploying. Pair the model with layout-aware OCR like Unstructured or Marker, and monitor a sample of outputs weekly.

Q: How much RAM does a 230M parameter model need?

A 230M parameter model in 4-bit quantization needs roughly 150-300 MB of RAM, which fits comfortably on a mid-range smartphone or microcontroller-class single-board computer. For comparison, a 1B model needs around 700 MB-1 GB, a 7B model needs 4-5 GB, and a 70B+ model needs 40+ GB of GPU memory. The small footprint means you can bundle the model and runtime into a desktop app or mobile binary in a few hundred megabytes. Inference latency is typically tens of milliseconds rather than seconds.

By Lazar Milicevic · Published June 26, 2026 · 9 min read

Developer running a small local language model on a laptop terminal for on-device data extraction

You're running a document pipeline that hits the OpenAI API thousands of times a day to pull structured fields out of PDFs. The work is boring — names, addresses, line items, invoice totals — but the bill at the end of the month isn't. And every call goes out to a vendor that could rate-limit you, deprecate your model, or change the price next quarter.

That's the exact problem Liquid AI's new 230-million-parameter model is built for. LFM2.5-230M is small enough to run on a laptop, a Raspberry Pi-class device, or directly inside a desktop app, and it's tuned specifically for the kind of structured extraction work that most teams overpay for. Here's what it actually means for builders shipping on-device or cost-sensitive workflows.

What LFM2.5-230M actually is

LFM2.5-230M is a 230-million-parameter foundation model from Liquid AI, designed for on-device agentic workflows — meaning it's built to run locally on phones, laptops, and embedded hardware rather than in a data center. Liquid AI was founded by researchers from MIT CSAIL working on Liquid Neural Networks, a class of dynamical models that aim to do more with fewer parameters than a vanilla transformer.

The headline claim from Liquid's release: at data extraction tasks, LFM2.5-230M holds its own against models roughly 4x its size. That's a specific, narrow claim — not "beats GPT-5 at everything." For the kind of work most automation pipelines actually do, narrow is the right shape.

A quick sanity check on what "230M parameters" means in practice:

Model size	Approx RAM (4-bit)	Typical home
230M	~150–300 MB	Phone, browser, microcontroller-class SBC
1B	~700 MB–1 GB	Laptop, edge device
7B	~4–5 GB	Workstation, decent GPU
70B+	40+ GB	Cloud GPU

A 230M model in 4-bit quantization fits comfortably in the memory budget of a mid-range smartphone. That's the unlock.

Why "small + local + extraction" is the right combo

Most production AI workloads are not "write me a creative essay." They are:

Pull 12 fields out of an invoice PDF.
Classify an incoming email into one of 8 buckets.
Normalize a messy address string into JSON.
Extract entities from a customer support transcript.
Decide if a webhook payload should trigger a follow-up.

These tasks have three properties that favor small local models:

The output is structured and constrained. You want JSON conforming to a schema. You don't need a 70B reasoning model to fill in {"invoice_number": "...", "total": ...}.
Volume is high, value-per-call is low. Paying GPT-class prices per extraction destroys the unit economics.
The data is often sensitive. Invoices, contracts, PHI, customer PII. Sending it to a third-party API creates compliance work you'd rather avoid.

A small local model directly attacks all three. The trade-off is that it won't reason its way through ambiguous edge cases the way a frontier model will. For a well-defined extraction task with a clear schema, that's a trade you'll often want to make.

Where it fits in a real pipeline

The mistake teams make with small models is treating them as a drop-in replacement for GPT-5. They're not. The right pattern is a routed pipeline: cheap local model handles the 80%, frontier model handles the hard 20%.

# Simplified routing pattern
def extract_invoice(pdf_text: str) -> dict:
    # 1. Try the small local model first
    local_result = lfm_extract(pdf_text, schema=INVOICE_SCHEMA)

    # 2. Validate against schema + confidence heuristics
    if is_valid(local_result) and local_result["_confidence"] > 0.85:
        return local_result

    # 3. Escalate to the frontier API only when needed
    return frontier_extract(pdf_text, schema=INVOICE_SCHEMA)

In real pipelines we've shipped, this kind of routing knocks 70–90% of calls off the expensive API without measurable accuracy loss — because most invoices, most emails, and most form-like documents are not edge cases. They're boring, and a small model handles boring extremely well.

The other place this pattern shines: latency. A local 230M model returns in tens of milliseconds, not seconds. If your UX is "user pastes text → fields auto-populate," local-first feels instant in a way no cloud API can match.

Running LFM2.5-230M locally

Liquid AI publishes its models on Hugging Face and the LEAP SDK. The simplest path for a backend service is transformers plus a small inference wrapper. For phone or desktop deployment, you'll want LEAP or a runtime like llama.cpp / MLX.

A minimal extraction example with transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
import json

MODEL_ID = "LiquidAI/LFM2.5-230M"  # check the current model card for exact ID

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto")

SYSTEM = """You extract structured invoice data. Output only valid JSON
matching this schema:
{
  "invoice_number": string,
  "vendor": string,
  "issue_date": "YYYY-MM-DD",
  "total": number,
  "currency": string,
  "line_items": [{"description": string, "amount": number}]
}"""

def extract(text: str) -> dict:
    messages = [
        {"role": "system", "content": SYSTEM},
        {"role": "user", "content": text},
    ]
    prompt = tokenizer.apply_chat_template(messages, tokenize=False,
                                           add_generation_prompt=True)
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    out = model.generate(**inputs, max_new_tokens=512, temperature=0.1)
    raw = tokenizer.decode(out[0][inputs.input_ids.shape[1]:],
                           skip_special_tokens=True)
    return json.loads(raw)

Two things to call out:

Use low temperature for extraction. 0.0–0.2. You want the model to be boring and consistent, not creative.
Don't trust freeform JSON. Use a constrained-decoding library (Outlines, lm-format-enforcer, or the structured-output features in llama.cpp) to force the model's output to conform to your schema. With a 230M model this matters more than with a 70B — small models are more likely to drift off-schema, but constrained decoding cleans that up completely.

For phone or robotics deployment, the workflow is different: convert to a quantized format (Q4_K_M or smaller is fine for extraction), bundle with your runtime, ship. The whole model + runtime can fit in a few hundred MB.

Honest limitations

A 230M model is not magic. Here's where it'll fall over, and what to do about each:

1. Long context, complex reasoning. If you need the model to read a 30-page contract and reason about a clause that references three other clauses, send that to a frontier model. Small models lose the thread on long, structurally complex inputs.

2. Rare formats and weird layouts. Out-of-distribution document layouts (handwritten forms, multi-column scientific PDFs, dense tables with merged cells) will reduce accuracy. Pre-process with a good layout-aware OCR (Unstructured, Marker, or commercial APIs) before handing text to the model.

3. Multilingual edge cases. Check the model card for language coverage before assuming it'll do well on non-English documents. Small models often have narrower language profiles than their bigger siblings.

4. Drift over time. As your input distribution shifts (new vendors, new invoice formats), accuracy degrades silently. You need monitoring — log a sample of outputs and have a human or a frontier model audit them weekly.

5. The benchmark vs. your data gap. "Beats models 4x its size at data extraction" is a benchmark claim. Your data is not the benchmark. Always run a pilot on 500–1,000 of your real documents before committing to a deployment plan.

A decision framework: when to pick small + local

Here's the framing we use with clients when they ask "should I run a small local model or just hit the API?"

Signal	Lean local (LFM2.5-230M class)	Lean cloud API
Volume	>10k extractions/day	<1k/day
Latency need	<200ms required	Seconds are fine
Data sensitivity	PII, PHI, contracts, financials	Public or low-sensitivity
Schema	Well-defined, stable	Open-ended, evolving
Edge cases	<20% of inputs	>40% of inputs
Deployment target	Phone, desktop app, on-prem	Pure SaaS backend
Engineering capacity	Have someone who can babysit a model	Want zero infra work

Two or more rows on the left and you should at minimum pilot a small local model. Two or more on the right and the API is probably the right call until volume justifies the switch.

The other dimension that matters: vendor risk. If your entire product is one API change away from breaking, owning a local model — even as a fallback — is good operational hygiene. We've watched two clients in the last year scramble when their primary model was deprecated. The teams with a local fallback shipped a one-line config change. The others rewrote prompts for a week.

How BizFlowAI approaches this

We've been running this exact pattern — small local model for the bulk, frontier API for the hard cases — in production document pipelines for solopreneurs and small ops teams. The typical setup processes invoices, receipts, or inbound leads, extracts structured data, validates it against a schema, and pushes it into accounting software, a CRM, or a database. A model in the 230M–1B range handles the majority of calls, with a routing layer that escalates only when confidence is low.

A model like LFM2.5-230M is a useful new option in that stack — particularly for clients who want extraction running on a laptop or behind their own firewall for compliance reasons. If you're sitting on a workflow where you're paying per-call for what looks like commodity extraction, that's the kind of thing worth a short discovery call. We'll tell you honestly whether a small local model is the right move or whether you should just keep hitting the API.

What to do this week

If you want to actually evaluate this for your pipeline, not just bookmark it:

Pull 500 real documents from your pipeline. Not synthetic, not the easy ones — a real sample including your weird edge cases.
Define your schema explicitly. Write it as JSON Schema or a Pydantic model. If you can't write the schema, you don't have an extraction problem yet, you have a product definition problem.
Run LFM2.5-230M on the sample with constrained decoding. Measure: schema-valid rate, field-level accuracy, p95 latency.
Run your current solution on the same sample. Same metrics.
Compute cost-per-1000-extractions for each. Include your time, not just API fees.
Decide on a routing threshold. What confidence level triggers escalation to a frontier model? Start strict (escalate often), loosen as you gain data.

You'll either confirm the small model is good enough for 70%+ of your traffic — which is usually a meaningful cost and latency win — or you'll learn your data is harder than you thought and stick with what you have. Either outcome is useful.

The broader point: model releases like LFM2.5-230M aren't headline-grabbing because they don't beat GPT-5 at writing essays. They're quietly important because they make a specific class of automation — the boring, high-volume, structured kind that actually runs businesses — cheaper, faster, and more private. That's the part of the AI stack worth paying attention to if you ship things.

Work with BizFlowAI

If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.

Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.

More guides like this on the BizFlowAI blog.

Frequently asked questions

What is Liquid AI's LFM2.5-230M model?

LFM2.5-230M is a 230-million-parameter foundation model from Liquid AI, the MIT CSAIL spinoff behind Liquid Neural Networks. It's designed for on-device agentic workflows and tuned specifically for structured data extraction tasks like pulling fields from invoices, emails, or forms. Liquid claims it matches models roughly 4x its size on extraction benchmarks. In 4-bit quantization it fits in 150-300 MB of RAM, so it runs on phones, laptops, and embedded hardware.

When should I use a small local LLM instead of the OpenAI API?

Use a small local model when you have high volume (>10k extractions/day), strict latency needs (<200ms), sensitive data like PII or PHI, a stable well-defined schema, and a deployment target like a phone or on-prem server. Stick with the cloud API when volume is low, edge cases dominate, or you want zero infra work. A common middle path is a routed pipeline where a local model handles 80% of easy cases and escalates the hard 20% to a frontier API. This pattern often cuts 70-90% of expensive API calls without measurable accuracy loss.

How do I run LFM2.5-230M locally for data extraction?

The simplest path is loading it from Hugging Face with the transformers library using AutoModelForCausalLM and AutoTokenizer, then prompting with a system message that defines your JSON schema. For phone or desktop deployment use Liquid's LEAP SDK or runtimes like llama.cpp or MLX with a Q4_K_M quantized build. Always set temperature to 0.0-0.2 for extraction work and use constrained decoding (Outlines or lm-format-enforcer) to guarantee schema-valid JSON output. Constrained decoding matters more with small models because they drift off-schema more easily.

What are the limitations of a 230M parameter model?

Small models struggle with long-context reasoning across structurally complex documents like multi-clause contracts, and they lose accuracy on out-of-distribution layouts like handwritten forms or dense merged-cell tables. They often have narrower multilingual coverage than larger siblings, and accuracy can drift silently as your input distribution shifts. Benchmark claims rarely match real-world data, so always pilot on 500-1,000 of your actual documents before deploying. Pair the model with layout-aware OCR like Unstructured or Marker, and monitor a sample of outputs weekly.

How much RAM does a 230M parameter model need?

A 230M parameter model in 4-bit quantization needs roughly 150-300 MB of RAM, which fits comfortably on a mid-range smartphone or microcontroller-class single-board computer. For comparison, a 1B model needs around 700 MB-1 GB, a 7B model needs 4-5 GB, and a 70B+ model needs 40+ GB of GPU memory. The small footprint means you can bundle the model and runtime into a desktop app or mobile binary in a few hundred megabytes. Inference latency is typically tens of milliseconds rather than seconds.