Context Mode: Solving the Agent Context Wall

6 March 2026 - 9 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New

Quieter day – nothing today that materially shifts the thesis.

Changelog

Date	Summary
6 Mar 2026	Initial publication.

Thirty minutes into a Claude Code session, something changes. Responses slow. The agent starts forgetting what it was doing. You re-explain context you already gave. The session that was running well has hit a wall.

This is not a model quality problem. It is an architecture problem. And someone built a fix.

Context Mode is an MCP server by Mert Köseoğlu – who runs the MCP Directory and Hub, processing 100,000+ daily requests. It has 2,500 GitHub stars. The core idea: every tool output that enters your context window should be compressed before it gets there, and every meaningful event in your session should be tracked so the agent can recover when compaction hits.

The numbers are striking enough to pay attention to: 315 KB of raw tool output compressed to 5.4 KB. Sessions that used to degrade at 30 minutes now run for 3 hours.

The Structural Problem

The context window has two sides. Tool definitions come in. Raw output comes out. Both eat tokens.

Cloudflare’s Code Mode addressed the input side – compressing tool definitions by 99.9% when you have 81+ tools active. Before Context Mode, nobody had seriously addressed the output side.

This matters because the output side is where sessions actually die. Here is what a normal Claude Code session looks like at the token level:

Playwright snapshot: 56 KB
20 GitHub issues: 59 KB
Access log (500 requests): 45 KB
Analytics CSV (500 rows): 85 KB

Each tool call that dumps raw data into context is non-recoverable. The tokens are spent. After 30 minutes of real work – running tests, checking issues, reading logs – you have consumed 40% of a 200K context window. Not on code. Not on reasoning. On raw data that could have been summarised.

The model cannot fix this alone. It does not control what tool outputs look like. It cannot compress a Playwright snapshot before it enters context. It can only work with what arrives, and what arrives is increasingly expensive.

The architecture has to change. Specifically: something needs to sit between tool outputs and the context window, process that data, and return only what matters.

How Context Mode Works: The Sandbox

The first half of Context Mode is context saving. An MCP server intercepts tool calls and routes them through an isolated subprocess instead of dumping raw output directly into context.

Each ctx_execute call spawns a sandboxed subprocess. You write code – JavaScript, TypeScript, Python, Shell, Ruby, Go, Rust, PHP, Perl, or R – that processes the raw data and emits only the meaningful result to stdout. That stdout is what enters the context window. The 56 KB Playwright snapshot, the 45 KB access log, the 85 KB CSV – they never reach context. Only the processed summary does.

The benchmarks, validated across 21 real scenarios:

Input	Raw	Compressed	Saving
Playwright snapshot	56 KB	299 B	99%
GitHub issues (20)	59 KB	1.1 KB	98%
Access log (500 requests)	45 KB	155 B	100%
Analytics CSV (500 rows)	85 KB	222 B	100%
Git log (153 commits)	11.6 KB	107 B	99%

Full session: 315 KB becomes 5.4 KB. 98% reduction.

This is the right abstraction. It is the same principle as good observability – you do not log the entire HTTP response body, you log what the response means. The raw data is processed at the boundary; the result is what flows downstream. Context Mode applies this to agent tool calls.

Authenticated CLIs work as you would expect: gh, aws, gcloud, kubectl, docker all pass through via credential inheritance. The subprocess picks up environment variables and config paths without exposing them to the conversation. This matters because most real-world coding sessions involve authenticated calls, and a solution that breaks that would not survive contact with actual use.

There is also ctx_batch_execute, which runs multiple commands and searches in a single call – 986 KB compressed to 62 KB in the repo research benchmark. Subagents calling tools in parallel is where raw context burn gets worst. Batching reduces both the token cost and the round-trip count.

Session Continuity: The Problem Nobody Talks About

Context burn is the visible problem. Most engineers frame it as token exhaustion – you run out of window and things slow down. But the deeper problem is coherence loss.

When Claude Code compacts the conversation to free space, it does not just trim tokens. It forgets. Which files were being edited. Which tasks are in progress. Which errors were resolved and which are still open. What you last asked for. The session continues, technically – but you are now working with an agent that has lost its working memory.

This is worse than starting fresh, because the agent does not know it has forgotten. It will proceed confidently with incomplete state, repeat work you already did, or ask you questions you already answered. Coherence loss is the mechanism behind that frustrating experience of re-explaining everything twenty minutes into what should be a continuation.

Context Mode’s second half is a session knowledge base. Every meaningful event during your session is tracked in SQLite with FTS5 full-text search:

File reads, edits, writes
Git operations (checkout, commit, push, diff, status)
Tasks created, updated, completed
Tool failures and non-zero exit codes
User decisions and corrections (“use X instead”, “don’t do Y”)
Environment changes (working directory, active virtualenv)

When compaction fires, a PreCompact hook builds a priority-tiered XML snapshot under 2 KB. Critical state – active files, pending tasks, user decisions, project rules – is always preserved. Lower-priority data (MCP tool counts, intent classification) is dropped first if the budget is tight.

After compaction, a SessionStart hook restores the snapshot and builds a Session Guide: structured sections covering the last user request, task status, key decisions, modified files, unresolved errors, git operations. The model continues from your last prompt with its working state intact. No re-prompting. No “what were we doing?”

The FTS5 implementation is worth understanding. BM25 ranking with Porter stemming at index time means “running”, “runs”, and “ran” all match the same stem. Three-layer search fallback handles typos and partial terms. It is not summarisation – when you retrieve indexed content, you get the actual code blocks and their heading hierarchy, not approximations. Relevance retrieval rather than context dumping means the agent sees what it needs, not everything that happened.

The Hooks Insight

There is a number buried in the Context Mode documentation that deserves more attention: without hooks, context savings compliance is around 60%. With hooks, it is around 98%.

The 40-percentage-point gap is the difference between “I added routing instructions to my CLAUDE.md” and “those instructions are enforced at the infrastructure layer.”

This is a pattern that comes up repeatedly in systems engineering. You cannot rely on a process to follow instructions consistently when the instructions are advisory and the process has better things to think about. The model, left to its own devices, will sometimes run a raw curl, read a large file directly, or dump unprocessed output into context – not because it is ignoring instructions, but because in the moment, the raw approach is the obvious one and the CLAUDE.md instruction is one of many competing signals.

Hooks change this. The PreToolUse hook intercepts every Bash, Read, WebFetch, Grep, and Task call before it executes. Dangerous commands are blocked and redirected to the sandbox. Routing guidance is injected in real time. The model does not have to remember to comply. Compliance is structural.

This is also why Codex CLI is a second-class citizen in the Context Mode ecosystem. Codex has no hook support – the PRs were closed without merge. The only enforcement mechanism is AGENTS.md instructions, which gets you to roughly 60% compliance. One unrouted Playwright snapshot (56 KB) wipes out an entire session’s worth of savings.

For Claude Code, hooks are automatic with the plugin install. For Gemini CLI and VS Code Copilot, they require a manual config step but are fully supported. For Codex CLI, you are working against the grain.

The lesson generalises: if you need consistent behaviour from an AI agent, write the enforcement into the infrastructure, not just the instructions. Instructions alone are not enough.

Install and What Changes

For Claude Code, two commands:

/plugin marketplace add mksglu/context-mode
/plugin install context-mode@context-mode

Restart Claude Code. The plugin installs the MCP server, configures PreToolUse and PostToolUse hooks, sets up session tracking, and creates a CLAUDE.md routing instructions file in your project root. You do not change how you work. The routing happens automatically.

If you want to try the tools without committing to the full plugin:

claude mcp add context-mode -- npx -y context-mode

This gives you the six sandbox tools without automatic routing – useful for evaluating before you commit.

After install, /context-mode:ctx-stats shows you context savings per tool, tokens consumed, and savings ratio. /context-mode:ctx-doctor diagnoses runtimes, hooks, FTS5, and versions. The diagnostics are specific enough to be useful, not just reassuring.

What actually changes in practice: your context window stops filling up at the 30-minute mark. Sessions that used to degrade now run for 3 hours. The 99% context remaining at 45 minutes versus 60% without Context Mode is not a marginal improvement – it changes the shape of what you can build in a single session. Tasks that previously required breaking up across sessions can now complete in one run, with the agent maintaining coherent state throughout.

The security model is also worth noting. Context Mode enforces whatever permission rules you have already configured in .claude/settings.json – if you block sudo, it is also blocked inside the sandbox. Zero additional setup if you have not configured permissions. Existing rules apply automatically if you have.

The Agent’s Ceiling

There is a different version of the context wall problem: the human one. The 4-hour ceiling in AI-assisted engineering work – where sessions degrade not because the agent runs out of tokens but because humans run out of the cognitive overhead required to keep an AI session productive – is partly about the effort of maintaining shared context with an agent that forgets.

Context Mode is the technical layer underneath that. It does not solve the human side. But it buys substantially more room on the agent side, which reduces the overhead that falls on the human.

30 minutes to 3 hours is not a minor improvement. It is the difference between a session that can complete a meaningful unit of work and one that cannot. The agent that compacts and forgets is a specific kind of frustrating: you have done the work of establishing context, and it is gone. Context Mode makes that loss recoverable.

The ceiling is real. This moves it.

Sources:

Context Mode on GitHub (MIT licensed)
mksg.lu/blog/context-mode – Mert Köseoğlu’s writeup