MRAgent vs LangMem: 27x Fewer Tokens Per Query

Developer working at laptop with terminal code, illustrating AI agent memory and token optimization

You're building an agent that needs to remember things across hundreds of turns — a research assistant, a customer support bot with multi-day cases, a coding agent that tracks a refactor over a week. Two weeks in, your token bill is 40x what you projected, latency is sliding past 15 seconds per turn, and the agent is still confidently citing a fact that was true three sessions ago. The bottleneck isn't the model. It's how you're feeding it memory.

A paper out of the National University of Singapore proposes a framework called MRAgent that reports a roughly 27x token reduction over LangMem on long-horizon reasoning benchmarks — about 118K tokens per query versus 3.26M. That's not a marketing number; that's the difference between a profitable agent and a science project. Here's what the architecture actually does, why retrieval-then-reason is the wrong default, and how to apply the same discipline to whatever you're shipping this quarter.

The real problem: retrieval pipelines return noise, not signal

Most production agents today follow the same pattern: stuff every observation into a vector store, embed the query, pull the top-k chunks, paste them into the prompt, reason. It works for FAQ bots. It collapses on anything that requires reasoning across many steps, because:

  • Top-k retrieval has no concept of relevance over time. A fact retrieved at turn 3 may be stale by turn 80, but it still has a high cosine similarity.
  • Chunks are context-free. A snippet that reads "the deadline is Tuesday" is meaningless without knowing which deadline, which project, which Tuesday.
  • Recall scales linearly with stored memory. More history means more retrievals, which means more tokens per turn, which means cost grows with conversation length — exactly the wrong scaling property.

LangMem and similar libraries handle this by being thorough. They store rich memory objects, retrieve generously, and let the LLM sort it out. That's why the per-query token count can balloon into the millions on a long task: the framework is doing the safe thing — give the model everything that might matter and trust it to ignore the rest.

The MRAgent approach inverts this. Instead of "retrieve then reason," the agent reasons about what it needs, fetches only that, and rewrites memory as understanding deepens. It treats memory as a workspace, not a warehouse.

What MRAgent actually changes

The paper's contribution isn't a new vector database or a smarter embedder. It's a control loop. At each step the agent:

  1. Decides whether it needs external memory at all. Most reasoning steps don't. If the current scratchpad is sufficient, skip retrieval entirely.
  2. Constructs a targeted query. Not the raw user message — a specific information need derived from the current reasoning state.
  3. Pulls a small number of memory items and integrates them into a working summary, not a flat context dump.
  4. Writes back consolidated memory when an episode closes, so future retrievals hit denser, higher-signal entries.

The token savings come from steps 1 and 4. Skipping retrieval when you don't need it is the single biggest win — most "memory-augmented" agents in the wild retrieve on every turn, including turns where the model is doing pure arithmetic or formatting. And writing consolidated memory means later retrievals return one well-formed entry instead of seven raw observations.

A simplified version of the control loop:

def agent_step(state, user_input):
    state.append_observation(user_input)

    # Step 1: do we even need external memory?
    if not needs_memory(state):
        return reason(state)

    # Step 2: targeted query, not the raw input
    info_need = derive_information_need(state)

    # Step 3: small, focused retrieval
    items = memory.search(info_need, k=3)
    working_summary = integrate(items, state.scratchpad)

    response = reason(state, working_summary)

    # Step 4: consolidate on episode boundaries
    if episode_complete(state):
        memory.consolidate(state.episode_buffer)

    return response

That's it. No exotic architecture. The expensive parts — needs_memory, derive_information_need, consolidate — are themselves LLM calls, but they're tiny and they prevent the much larger cost of dragging full memory into every reasoning turn.

Why "retrieve then reason" was always a placeholder

Retrieve-then-reason came from RAG, and RAG was designed for stateless question answering: one question, one answer, no continuity. It got grafted onto agents because the API surface was familiar and the tooling existed. But agents have state, and state changes the math.

A stateless QA system retrieves once per question. An agent on a 100-turn task using the same pattern retrieves 100 times. If each retrieval pulls 5K tokens of context (modest), that's 500K tokens of memory-related I/O before you've counted the actual reasoning. On a long benchmark task, this is exactly how you get to 3M+ tokens per query.

The fix isn't a better retriever. It's recognizing that an agent's memory access pattern should look more like a programmer using grep — targeted, infrequent, with the results immediately consolidated into local variables — and less like a librarian wheeling out a full cart of books every time you ask a question.

Comparing memory architectures honestly

Here's where each approach actually fits. No framework is universally best; the right one depends on conversation length, latency budget, and how often memory matters per turn.

Approach Best for Token cost per turn Failure mode
In-context only (no external memory) Short tasks, <20 turns Low, grows linearly Context window overflow
Vanilla RAG / vector top-k Stateless QA, doc search Medium, constant Stale or irrelevant chunks dominate
LangMem-style rich memory Agents needing high recall, cost-tolerant High, grows with history Token blowup on long tasks
MRAgent-style dynamic memory Long-horizon reasoning, cost-sensitive Low-medium, sub-linear More complex to debug; consolidation bugs
Hand-crafted state machine Narrow workflows Very low Brittle, doesn't generalize

The honest read: LangMem isn't wrong, it's optimized for a different point on the curve. If you're running a high-value agent where missing a memory recall costs more than $5 in lost work, paying for over-retrieval is rational. If you're running a $0.20-per-task assistant at volume, MRAgent's discipline is the difference between margin and red ink.

Building this yourself: a practical pattern

You don't need to wait for an MRAgent library. The architectural ideas port cleanly into any agent framework. Three patterns to steal:

1. Gate retrieval behind a cheap classifier. Before any retrieval call, run a small prompt that asks "does this turn require external memory? yes/no." Use a cheap model. In our experience this skips retrieval on 40-60% of turns in typical assistant workloads, with minimal recall loss.

GATE_PROMPT = """Given this conversation state and the user's latest message,
does answering correctly require recalling information from prior sessions
or external memory? Answer only YES or NO.

State summary: {summary}
Latest message: {message}"""

def needs_memory(state) -> bool:
    response = cheap_model.complete(
        GATE_PROMPT.format(summary=state.summary, message=state.last_message)
    )
    return response.strip().upper().startswith("YES")

2. Derive an information need, don't query with the raw input. "What's the status of the migration?" is a terrible retrieval query. "Postgres-to-Aurora migration status as of last engineering sync" is a useful one. Let the model rewrite the query against current state before hitting your vector store.

3. Consolidate on episode boundaries, not on every write. Append raw observations to a hot buffer during an episode. When the episode closes — user signs off, task completes, timeout fires — run a single consolidation pass that writes one well-formed memory entry per resolved fact. Future retrievals hit the consolidated entry, not the raw stream.

CONSOLIDATE_PROMPT = """Below is the raw observation buffer from a completed
episode. Extract durable facts (entities, decisions, commitments, state changes)
that may be relevant in future sessions. Output one entry per fact in this format:

FACT: <one sentence>
ENTITY: <primary entity>
EXPIRES: <ISO date or NEVER>

Raw buffer:
{buffer}"""

That EXPIRES field is underrated. Stale memory is the dominant failure mode in long-running agents, and a hard expiration date is the cheapest way to prevent yesterday's truth from becoming tomorrow's hallucination.

Where this breaks (and how to know)

MRAgent-style architectures aren't free wins. Three failure modes to watch:

Consolidation drift. Every consolidation pass is an LLM call, and LLMs paraphrase. After 20 consolidations of the same fact, the entry can drift far from ground truth. Mitigation: store the original observation alongside the consolidated entry, and re-derive consolidation from originals periodically rather than consolidating consolidations.

Gate false-negatives. If your needs_memory classifier says NO when it should have said YES, the agent confidently invents an answer. Mitigation: log every NO decision, sample 1-2% for human review during the first month, and add specific NO-then-correct examples back into the gate prompt.

Latency variance. Dynamic memory means some turns are fast (no retrieval) and some are slow (retrieval + integration + reasoning). If your UX expects consistent response time, the variance can feel worse than a slower-but-predictable baseline. Mitigation: stream partial responses, and pre-warm consolidation in the background between user turns.

The 27x token reduction is real for the workloads in the paper. For your workload, expect somewhere between 3x and 15x once you account for your specific gate accuracy and consolidation overhead. That's still a margin-changing improvement, but don't budget for the paper's headline number.

A measurement protocol before you switch frameworks

Before you rewrite anything, instrument the agent you have. Most teams have no idea what their memory architecture actually costs per turn. Track these five numbers for a week:

  1. Mean tokens per turn, split into: system prompt, retrieved memory, scratchpad, model output.
  2. Retrieval calls per turn (target: <1 on average for assistant workloads).
  3. Retrieved-but-unused ratio: how many retrieved chunks does the model actually cite in its response? In most deployments this is below 20%, which is your first signal that you're over-retrieving.
  4. Memory recall error rate: how often does the agent fail to recall something it should have? Sample manually; there's no automated metric worth trusting here.
  5. P95 latency per turn, separate from mean.

If retrieved-but-unused is above 50% and mean tokens per turn is dominated by retrieved memory, you're a candidate for MRAgent-style changes. If retrieval calls per turn is already below 0.5 and recall errors are rare, you're fine — don't fix what isn't broken.

How BizFlowAI approaches this

We've shipped enough production agents to know that memory architecture is where the unit economics get decided. The agents that ship and survive contact with real users aren't the ones with the cleverest planning loops — they're the ones with disciplined context engineering: gated retrieval, episode-based consolidation, hard expirations on stored facts, and instrumented token budgets per turn.

When we audit an existing agent for a client, the first thing we measure is the retrieved-but-unused ratio and the per-turn token breakdown. In most cases there's a 5-10x cost reduction available before touching the model or the prompt, just by fixing how memory flows. If you're running an agent in production and the token bill is climbing faster than usage, that's the conversation worth having.

What to take away

The MRAgent paper isn't important because of the specific framework. It's important because it names a pattern most production teams have been working around without articulating: retrieve-then-reason is a stateless idea grafted onto stateful systems, and the cost shows up in your bill before it shows up in your benchmarks.

The fixes are not exotic. Gate retrieval behind a cheap classifier. Derive targeted queries from current state. Consolidate on episode boundaries with explicit expirations. Measure your retrieved-but-unused ratio before you change anything. None of this requires a new library — it requires treating memory as an engineering problem with budgets and assertions, not as a library call you trust to do the right thing.

The teams that get this right will run agents at margins that make the "AI is too expensive" teams look like they're solving a different problem. Because in a real sense, they are.


Work with BizFlowAI

If you'd rather have this built for you, that's what we do: production AI automation for solo founders and small teams — agents, integrations, and document pipelines that actually ship.

Book a free discovery call — 30 minutes, we map the highest-ROI automation in your workflow. No pitch deck, just engineering.

More guides like this on the BizFlowAI blog.

Frequently asked questions

What is MRAgent and how does it differ from LangMem?

MRAgent is a memory framework from the National University of Singapore that uses a dynamic control loop to decide when an agent needs external memory, instead of retrieving on every turn. Unlike LangMem, which stores rich memory objects and retrieves generously, MRAgent gates retrieval, derives targeted queries, and consolidates memory at episode boundaries. The paper reports roughly 27x fewer tokens per query than LangMem on long-horizon benchmarks (about 118K vs 3.26M tokens). The tradeoff is more complex debugging and potential consolidation drift.

Why does retrieve-then-reason fail for long-running AI agents?

Retrieve-then-reason was designed for stateless RAG question answering, where one query returns one answer. When applied to a 100-turn agent, the pattern retrieves 100 times, often pulling thousands of tokens of context per call, which scales linearly with conversation length. Top-k retrieval also has no concept of time, so stale facts with high cosine similarity keep getting injected. The result is token blowup and the agent confidently citing outdated information.

How can I reduce token costs in a memory-augmented AI agent?

Gate retrieval behind a cheap yes/no classifier before every memory call, which typically skips 40-60% of turns in assistant workloads. Rewrite the user message into a specific information need before querying the vector store, rather than embedding the raw input. Consolidate raw observations into single well-formed memory entries when an episode ends, so future retrievals hit dense entries instead of fragmented chunks. Add an EXPIRES field to memory entries to prevent stale facts from being recalled.

When should I use LangMem instead of an MRAgent-style architecture?

LangMem is the rational choice when missing a memory recall costs more than the price of over-retrieval, such as high-value research or enterprise agents where errors are expensive. Its thorough retrieval strategy maximizes recall at the cost of tokens. MRAgent-style designs are better for high-volume, cost-sensitive workloads where per-task margins are thin. Neither is universally best; the right choice depends on conversation length, latency budget, and the cost of a missed recall.

What is the biggest source of token savings in MRAgent?

The largest savings come from skipping retrieval on turns that don't need external memory, since most reasoning steps (arithmetic, formatting, simple follow-ups) don't require it. The second source is consolidating raw observations into single dense memory entries at episode boundaries, so later retrievals return one high-signal item instead of many raw chunks. Together, these reduce both retrieval frequency and per-retrieval token volume. The control-loop overhead is small because gating and consolidation use cheap LLM calls.