Claude's 1M Context Window Is Now GA -- What Actually Changes for Engineers

14 March 2026 - 7 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New

Anthropic announced on March 13, 2026 that the 1M token context window is generally available for Claude Opus 4.6 and Sonnet 4.6, at standard pricing with no long-context premium. This removes the beta header requirement, raises media limits to 600 images or PDF pages per request, and extends the capability to Claude Code Max, Team, and Enterprise users automatically.

Changelog

Date	Summary
14 Mar 2026	Initial publication.

If you’ve run a long Claude Code debugging session, you know the moment: the model summarises what it knew before, and something important disappears into the compression. A function signature you explained twenty minutes ago. The specific error pattern you established early on. The cross-file dependency the agent had correctly identified before context pressure kicked in. You spend the next stretch re-establishing ground you’d already covered.

That’s the compaction problem. And it’s what 1M context actually solves.

What GA Means

The headline numbers: Claude Opus 4.6 at $5/$25 per million tokens (input/output), Sonnet 4.6 at $3/$15. Those rates apply uniformly across the full context window. A 900K-token request costs the same per token as a 9K one. No multiplier, no premium tier, no special pricing for long context.

What else changes with general availability:

No beta header required. If you were sending the anthropic-beta: long-context-window-2024-11-05 header, it’s now silently ignored. Requests over 200K tokens work automatically. Zero code changes needed.
Full rate limits at every context length. Your standard account throughput applies across the whole window. Previously, long-context requests counted differently.
Media up to 600 images or PDF pages per request, up from 100.
Claude Code included. Max, Team, and Enterprise users get 1M context with Opus 4.6 automatically. Previously it required extra usage allocation.

Available on Claude Platform directly, Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry.

The Compaction Problem Is the Real Story

The 1M token number is easy to anchor on, but it’s the wrong thing to focus on. Most engineers who hit context limits weren’t frustrated by hitting 200K – they were frustrated by what happened next.

Compaction is lossy. When Claude runs out of context headroom and summarises earlier conversation to make space, you don’t get a clean abstract of what was removed. You get an approximation. Subtle details drop. Cross-file dependency graphs that the model had built up internally get flattened. The model continues working, but on a degraded version of the context you painstakingly established.

The practical effect is that long-context work collapses into repeated-short-context work with extra overhead. You’re not reasoning across a large problem space – you’re reasoning across chunks, losing fidelity at each boundary, and spending tokens re-establishing what you’d already established.

Anton Biryukov, a software engineer using Claude Code in production, puts it directly: “Claude Code can burn 100K+ tokens searching Datadog, Braintrust, databases, and source code. Then compaction kicks in. Details vanish. You’re debugging in circles. With 1M context, I search, re-search, aggregate edge cases, and propose fixes – all in one window.”

The 15% decrease in compaction events that Jon Bell (CPO) reports isn’t a cosmetic improvement. It represents sessions that stayed coherent instead of degrading midway through.

What This Actually Unlocks

Some use cases are now feasible that weren’t before – not because of the number itself, but because the number is large enough to hold entire problem spaces in a single pass.

Multi-file code review without chunking. Adhyyan Sekhsaria from the Devin team describes the before state: “Large diffs didn’t fit in a 200K context window so the agent had to chunk context, leading to more passes and loss of cross-file dependencies.” With the full diff in a single window, the agent can reason about changes holistically. You get higher-quality reviews from a simpler harness – fewer passes, no boundary artifacts, no context-switching overhead.

Full incident response traces. Production incidents generate enormous context: every log query, every tool call, every hypothesis tested, every dead end. With 200K, you were either truncating the trace or chunking the analysis. At 1M, you can keep the full incident in view from first alert to remediation. Mayank Agarwal (Founder, CTC) notes they can now “keep every entity, signal, and working theory in view from first alert to remediation without having to repeatedly compact or compromise the nuances of these systems.”

Legal and contract analysis across a full negotiation history. Bringing five rounds of a 100-page partnership agreement into one session is now practical. The model can see the full arc of what changed and why – which is precisely what cross-document legal reasoning requires.

Scientific literature synthesis. Reasoning across hundreds of papers, proofs, and codebases simultaneously. Not sequentially, not with retrieval in between – in a single pass. For domains where connections between sources matter as much as individual sources, this is a qualitatively different capability.

Codebase-as-context. One million tokens holds a large monorepo – 100K to 500K lines depending on verbosity. That’s enough to load working context rather than using retrieval. For certain analysis tasks, having the whole codebase in context beats RAG by eliminating retrieval latency and the quality loss from approximate nearest-neighbour search.

The performance claim matters here: Opus 4.6 scores 78.3% on MRCR v2 (Multi-needle Retrieval and Comprehension) at 1M context, the highest among frontier models at that context length. A million tokens of context is only useful if the model can actually retrieve and reason across what’s in it. The benchmark suggests it can.

When RAG Is Still the Right Answer

This is the part that gets glossed over in announcements.

1M context doesn’t obsolete retrieval augmented generation. The architectural question is specific: is the thing you’re working with static enough, bounded enough, and important enough to put in context – or is it dynamic, large, or ephemeral enough that retrieval makes more sense?

Dynamic corpora. A knowledge base that updates continuously. Customer support documentation that changes weekly. The context window is a snapshot; RAG retrieves fresh content. If staleness matters, retrieval wins.

Cost at scale. Nine hundred thousand tokens at $5/million is $4.50 per request input cost, before output. For high-volume production systems processing thousands of requests, that’s meaningful. A well-optimised RAG pipeline over a vector store can be significantly cheaper per query if the retrieval quality is sufficient for your use case.

Latency. Large context requests take longer to process. For real-time applications where response time matters, retrieval can deliver relevant context faster than loading the full window.

Scale beyond 1M. If your corpus genuinely exceeds 1M tokens, retrieval is still the only option. The window is large, but it’s not infinite.

The build decision shifted. Previously, engineers often chose RAG specifically to avoid long-context premiums – not because retrieval was the right architecture, but because the economics forced it. That constraint is gone. The calculus is now purely about what the use case actually needs, rather than being distorted by pricing tiers.

For an extended take on when retrieval versus context is the right call for local inference deployments, see AI self-hosting and the model landscape overview for context on where different models sit on the cost-capability curve.

What Changes for Claude Code Users Today

The compaction improvement is the immediate, practical headline for engineering teams. Fewer compaction events means:

Less context re-establishment overhead. You’re not spending tokens re-explaining what you already explained.
Better cross-file dependency tracking during refactors. The model’s internal map of the codebase stays intact across the session.
Longer uninterrupted coding sessions. Agent sessions that previously degraded after an hour can now run longer without losing fidelity.
More reliable multi-file diffs. The full change is in context; the agent doesn’t have to reason about partial views and stitch them together.

This is a developer experience improvement, not just a capability one. The theoretical context length always sounded impressive. The practical improvement is that the sessions you’re already running will be less likely to hit the wall.

For production reliability considerations – how to think about context window management in AI systems you’re deploying – see LLM acceptance criteria for production systems.

The Architectural Implication

One million tokens isn’t the destination. It’s the point where context size stops being the dominant architectural constraint for most applications.

The interesting question shifts. Previously, a significant portion of the design work in AI systems was managing context: what to include, what to summarise, what to retrieve, how to handle compaction gracefully. That work didn’t make the system better – it managed a limitation. With 1M context at flat pricing, those engineering hours redirect toward what the system is actually trying to do.

The constraint that shaped a lot of AI system architecture for the past two years just got substantially weaker. The question now is what you build differently when context is abundant.