Long-Horizon Memory: The Gap Between Context and Remembering

16 March 2026 - 9 mins read

This post is about persistent memory across sessions. For the related problem of large context windows within a session – and what happens when they fill up – see AI and the 1M token context window.

What’s New This Week

AMA-Bench (arxiv:2602.22769) dropped earlier this month – a new benchmark specifically evaluating long-horizon memory for agentic applications. It distinguishes between memory processing (transforming trajectories into structured facts) and memory retrieval (getting those facts back when needed). Most existing benchmarks only test retrieval. That distinction matters enormously for production systems.

Changelog

Date	Summary
16 Mar 2026	First published.

AI systems have context. They don’t have memory. Most engineers use the terms interchangeably and then spend months debugging why their agent keeps asking users the same questions.

The distinction is precise and it matters. Context is what the model sees right now – the tokens in the current request. Memory is what persists when the session ends. For most deployed systems today, memory is nothing. The session closes, the context clears, and the model wakes up tomorrow knowing exactly as much as it did the first time it met this user.

That works fine for stateless tasks – writing a function, answering a factual question. It fails immediately for anything relational. Customer support, personal assistants, coding agents that build on weeks of architectural decisions, healthcare systems that need to know what a patient reported six months ago. All of these require an agent that remembers. None of them get one by default.

Context is Not Memory

A context window is a working memory for a single session. Everything the model can reason about right now has to be in that window. Modern frontier models have pushed this to genuinely useful sizes – a million tokens covers a lot of session history – but the window still closes. Next session, it starts empty.

RAG – retrieval-augmented generation – is the obvious answer and it helps, but it is not memory either. RAG is a query-dependent lookup. You ask a question, the system retrieves semantically similar content, the relevant context arrives in the window. The problem is the query dependency: the system only retrieves what you know to ask for. If a user mentioned three months ago that they were switching from AWS to GCP, that fact will not surface unless the current query happens to trigger it. Most relevant context is not triggered by explicit queries – it is context the system should have incorporated proactively.

Persistent memory is different. It is the accumulation of what happened – events, decisions, preferences, corrections – indexed with enough structure that it can be retrieved not just on explicit query but on relevance. It has temporal awareness. It knows that a fact established two years ago might be stale while a fact from last week is probably still live. RAG has none of that. It stores embeddings and returns cosine similarity. That is a tool. Memory is the architecture built around tools like that.

Why Persistent Memory Is Hard

Four distinct problems make this difficult, and most systems underestimate all four.

Forgetting between sessions. There is no native mechanism in most LLM architectures for cross-session persistence. Everything has to be engineered. Which events are worth storing? How are they stored? When does the system decide to write a memory versus discard the interaction as routine? This requires active memory management – the agent deciding what matters – rather than passive logging.

Temporal reasoning. “I told you this three weeks ago” requires two things: a time-stamped record of what was said, and inference capability that understands the temporal relationship between that record and the current conversation. Many systems store timestamps but few use them in reasoning. The model needs to know that three weeks old is recent, that two years old may be stale, that a user correcting their own prior statement means the new version supersedes the old one.

Relevance decay. Some facts are evergreen – a user’s name, their organisation, their core preferences. Some facts expire – the project they were working on last quarter, the ticket they raised that has since been resolved. AI systems are poor at distinguishing these. They either store everything forever (expensive, noisy) or summarise aggressively (lossy, misses detail). The right model is closer to what humans do: stable facts reinforced and kept current, episodic memories that decay unless referenced again.

Contradiction handling. Users change their minds. They give the system information that conflicts with what they said before. A system that appends facts without reconciling contradictions will accumulate incoherent memory that the model cannot reason about reliably. Managing contradictions requires not just storage but active inference: this new fact updates the prior one; these two facts are in tension and I should flag that.

Current Approaches and Where They Break

Large context windows are the most common response to memory limitations – just make the window big enough to hold everything. This works for single-session depth but does nothing for cross-session persistence. And at 1M tokens, it is expensive. The cost of sending an entire interaction history on every request does not scale to production systems with millions of users.

Structured memory summaries – OpenAI’s memory feature, Claude’s memory – are genuinely useful for stable facts: preferences, names, standing instructions. They are poor for episodic detail. “The user prefers direct communication” survives summarisation. “The user raised a complaint about order #48271 on 12 February and was told it would be resolved by end of month, which it was not” does not. The first is a preference; the second is an event. Most deployed memory systems are built for preferences and struggle with events.

MemGPT and Letta take a different approach: treating context management as the central problem and giving the agent tools to manage its own memory. The MemGPT architecture (now underlying the Letta framework) treats the context window like OS RAM and external storage like disk – agents actively move information between in-context core memory and externally stored archival and recall memory. Sleep-time agents can reorganise and refine memory asynchronously, improving memory quality without adding latency to the live request.

Letta’s benchmarking work revealed something instructive: a Letta agent using simple filesystem tools – grep, semantic search, open/close – outperformed specialised memory tools like knowledge graphs on standard retrieval benchmarks. 74% on LoCoMo with GPT-4o mini, against 68.5% for Mem0’s top-performing graph variant. The lesson is that agent capability matters more than retrieval sophistication. A model that knows how to search iteratively, reformulating queries until it finds what it needs, outperforms a more complex system the model uses poorly.

AMA-Bench makes a further distinction that current benchmarks mostly miss: the difference between processing trajectories into structured memory and retrieving from that memory later. Most benchmarks test only retrieval, which means they measure the back half of the problem and ignore the front.

Where This Matters in Production

Customer support. The failure mode here is obvious and users hate it. Calling support, explaining context, being transferred, explaining it again. An agent with persistent memory knows the complaint history, the prior resolutions, the unresolved tickets. This requires episodic memory with temporal context – not just “this user has had issues” but “this user raised this specific issue on this date and was told X, which did not happen.”

Coding agents. Architectural decisions made weeks ago – why a module is structured a particular way, which approaches were tried and rejected, what constraints the system is operating under – are exactly the kind of episodic context that disappears between sessions. A coding agent without this memory will propose solutions that were already tried and rejected, or make changes that conflict with design decisions it no longer knows about. This is why AGENTS.md exists as a practice – writing down what the agent needs to know because the agent will not remember it otherwise.

Personal assistants. Preference memory is table stakes. The harder and more valuable problem is project continuity – knowing where a task was left, what was decided, what is still open. This is the agent as a second brain, not just a preference engine.

Healthcare. Patient history over years is the canonical long-horizon memory problem. Here the stakes are highest and the requirements most stringent – temporal accuracy, contradiction flagging, provenance of information. This is where current approaches fall shortest.

Event and entertainment recommendations. Taste built from years of event attendance, purchases, and expressed preferences is a genuine long-term memory problem. The recommendation systems that power venues and ticketing understand this – user taste is longitudinal, not just a snapshot of recent behaviour. AI agents that assist with discovery need the same longitudinal view.

What to Build Today

The honest answer is that no single approach solves this. The architecture that works is a hybrid:

Structured core memory for stable facts. Names, preferences, standing context. Keep these in-context and editable. This is the Letta memory blocks model – small, pinned, actively maintained.

RAG for episodic retrieval. Time-stamp every stored memory. Treat retrieval as iterative, not single-shot – let the agent search, reformulate, and search again. The Letta filesystem result shows that a capable agent using simple tools outperforms a less capable agent using sophisticated ones.

Temporal metadata. When was this fact established? When was it last confirmed? Has it been contradicted? These fields are not optional; they are what makes long-horizon memory coherent rather than just large.

Explicit memory review cycles. The AGENTS.md pattern captures this: important context written down deliberately rather than assumed to persist. Agents that rely on implicit memory accumulate noise. Agents that actively maintain their own memory state – writing, updating, reconciling – are more reliable over time. Letta’s sleep-time agents formalise this: dedicated memory maintenance that runs asynchronously, improving memory quality without blocking live requests.

Letta or MemGPT for agents that need genuine continuity. If you are building a system where cross-session memory is load-bearing – not a nice-to-have – the Letta framework gives you a production-grade implementation of the MemGPT architecture. It handles eviction, summarisation, archival storage, and agent-managed memory blocks. The alternative is building all of this yourself.

The Gap Remains

The 1M context window made it much easier to maintain continuity within a session. That is genuinely useful. But it does not solve the problem of continuity across sessions, and the engineering required to build that continuity is significant, underestimated, and often omitted entirely from early production deployments.

The agents that will matter in three years are not faster or smarter – they are the ones that know you. That requires memory in the real sense: persistent, temporal, coherent, and actively maintained. We have the tools to build it. We do not yet have it by default.