What happened: 1M tokens at standard pricing, no premium
On March 13, 2026, Anthropic announced that its 1 million token context window is now generally available for Claude Opus 4.6 and Claude Sonnet 4.6. The announcement includes something significant beyond the context size itself: standard pricing applies across the full window. A request using 900,000 tokens costs the same per-token rate as one using 9,000. The long-context premium that previously applied above 200K tokens is gone.
The media limits have also expanded substantially. Where the previous beta allowed 100 images or PDF pages per request, the GA version supports up to 600. The beta header requirement has been removed: requests over 200K tokens now work automatically. The 1M window is available today on the Claude Platform natively and through Amazon Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry.
On benchmarks, Opus 4.6 scores 78.3% on MRCR v2 at the full 1M context length, which Anthropic reports is the highest among frontier models at that size. For Claude Code users on Max, Team, and Enterprise plans, the 1M window is now the default for Opus 4.6 sessions, meaning fewer compaction events and longer uninterrupted agent runs.
Why it matters: chunking was always a workaround, not a solution
For most enterprise document workflows, chunking was never the intended solution. It was a workaround for context limits. When you split a 200-page contract into overlapping chunks, you're accepting that the model will never see the full document at once. Cross-references between clauses on page 4 and page 187 require careful retrieval design. Contradictions across sections are easy to miss. The context window was the bottleneck, and engineers built elaborate RAG pipelines around it.
600 pages in a single request changes that calculus. A full set of quarterly financial reports. An entire due diligence bundle. A complete tender submission with appendices. A case file with depositions and exhibits. All of these fit comfortably within 600 pages, and all of them can now be processed in a single model call without any chunking overhead. The reasoning is over the full document, not a retrieval-approximated fragment of it.
For AI agents running multi-step workflows, the impact is equally significant. Production agents accumulate context: tool call outputs, intermediate reasoning, database query results, API responses. At 200K tokens, compaction was a constant pressure. Details from early in the workflow would get summarized away, leading to agents that "forgot" critical information and required redundant re-fetching. At 1M tokens with no premium, a complex agent can run for hours without losing its working memory, and the economics of doing so are the same as a simple single-turn request.
Laava's perspective: full-document reasoning was always the goal
At Laava, we've been building document processing agents since before context windows made it easy. We developed chunking strategies, overlap parameters, and retrieval pipelines precisely because we had to. Our Context Layer, the metadata and structuring work that underpins every system we build, was partly designed to compensate for the limits of what a model could hold at once.
The 1M context window at standard pricing does not make that metadata work obsolete. A model can hold a large document in context and still reason poorly about it if the document lacks structure, version tagging, or authority metadata. What changes is the trade-off between retrieval precision and full-document comprehension. For certain document types, particularly long contracts, regulatory filings, and complex case files, it is now often better to load the full document directly than to retrieve relevant chunks. The residual RAG pipeline becomes a fallback rather than the primary strategy.
This is also a pricing story. For clients running high-volume document workflows, the previous long-context premium created pressure to keep context short, which meant chunking more aggressively, which meant lower accuracy. Removing that premium breaks the cost-accuracy trade-off. You can now design for accuracy first and not pay a surcharge for it. For the 4-week pilots we run with clients, this directly shortens the time needed to get from raw documents to production-grade extraction, because we no longer need to tune retrieval pipelines before the agent can reason across a full document.
What you can do now
If you're running document-heavy AI workflows today, revisit your chunking strategy. For documents under 600 pages, full-document loading is now a viable option worth benchmarking against your current retrieval approach. You may find that accuracy improves and architectural complexity drops at the same time. If you're using the Claude API with a beta header for long-context requests, you can remove it: it is now ignored, and the 1M window activates automatically.
If you're evaluating AI agents for document processing and have been told that RAG complexity is unavoidable, that assessment may be outdated. The architectures that made sense at 100K token limits are worth revisiting. Laava helps organizations design document automation that fits the current state of the technology, not last year's workarounds. If you have a document-heavy process you've been hesitant to automate because the accuracy requirements seemed too high, now is a good time to look again.