This post covers Gemini 3.1 Pro as of its February 19, 2026 preview launch. Pricing, availability, and benchmark positions may shift before general availability.
What’s New This Week
No material updates since the February 19 launch. Model remains in preview. Benchmark rankings are current as of publication – the AI model landscape will track any changes as the model moves toward GA.
Changelog
| Date | Summary |
|---|---|
| 20 Mar 2026 | Initial publication. |
Google released Gemini 3.1 Pro on February 19 in developer preview. The headline numbers: 77.1% on ARC-AGI-2, more than double the previous version, and #1 out of 123 models on the Artificial Analysis Intelligence Index. Both numbers are real. Both need context before you route production traffic to them.
What ARC-AGI-2 actually measures
ARC-AGI-1 is effectively solved. Frontier models now score in the high 80s to 90s, and the benchmark has stopped being informative. François Chollet designed ARC-AGI-2 specifically to resist that saturation – it tests a model’s ability to solve entirely novel logic patterns, the kind that can’t be brute-forced through training data memorisation.
The framing matters: this isn’t a benchmark that measures “how much does the model know.” It measures whether the model can reason through something it hasn’t seen before. That’s a harder property to fake.
77.1% on ARC-AGI-2 being “more than double” Gemini 3 Pro implies the previous version was somewhere around 35-40%. That’s not a marginal improvement from better training data or fine-tuning. That’s a genuine shift in reasoning capability. The kind that doesn’t happen every release cycle.
For comparison: GPT-5 and Claude Opus 4.6 are competitive in this range, but 3.1 Pro reaching 77.1% puts it solidly at the frontier. Whether that translates to better outputs on your specific use case is a separate question – ARC-AGI-2 performance is a strong signal, not a guarantee.
#1 on AAII – and the verbosity problem
The Artificial Analysis Intelligence Index aggregates performance across a broad set of tasks and ranks models on overall intelligence quality. Gemini 3.1 Pro scored 57 against a field average of 31, topping 123 models. That’s a substantial margin.
The caveat is sitting in the same data: Gemini 3.1 Pro generated 57 million tokens during AAII evaluation, against an average of 13 million. It’s roughly 4x more verbose than the typical frontier model for equivalent quality output.
At $12 per million output tokens, that verbosity gap is not theoretical. If you’re building an API-heavy workflow and you model costs against the headline price, you’ll undershoot the real number. A task that costs $0.60 with an average model at similar quality could cost $2.40 with 3.1 Pro if the verbosity pattern holds in your use case. Some tasks will be fine. High-volume production workflows need actual cost profiling before committing.
This isn’t a reason to rule it out. It is a reason to measure rather than assume.
The 1M context window field
Context window parity has arrived at the frontier. Claude Opus 4.6, Claude Sonnet 4.6, and Gemini 3.1 Pro all offer 1M token context. GPT-5 is at 128K. The long-context frontier is Google and Anthropic. OpenAI hasn’t closed that gap yet.
For practical use, 1M tokens means you can pass in entire codebases, lengthy document sets, or extended conversation history without chunking strategies. It also means the model can synthesise across large input sets rather than requiring you to summarise and inject context manually. That’s a real capability difference for the use cases it applies to.
3.1 Pro also has a 64K output limit – generous relative to most frontier models. Combined with 1M context, the model is well-positioned for tasks that require reading a lot and writing a lot.
Speed as a real differentiator
120 tokens per second is fast for a frontier-tier model. Claude Sonnet 4.6 runs at 80-100 t/s at peak; Claude Opus is slower. At this quality tier, the speed difference matters.
For streaming UX, faster generation is directly user-visible. For agentic workflows that make multiple sequential model calls, latency compounds. A 40% throughput advantage across an agent loop with 10 LLM calls isn’t noise – it’s the difference between a responsive tool and one that feels slow.
The 3.1 Pro speed-quality combination – #1 intelligence index, 120 t/s – is unusual. Typically you trade speed for quality at the frontier. Google has managed to maintain both, at least at preview. Whether that holds under production load remains to be validated.
The Google ecosystem distribution play
Gemini 3.1 Pro is available in: the Gemini API via Google AI Studio, Gemini CLI, Google Antigravity (their new agentic development platform), Android Studio, Vertex AI, Gemini Enterprise, the Gemini app, and NotebookLM.
This is the same distribution play OpenAI made with Codex and Astral – get the model into everywhere developers already work, not just into the API. The quality matters; the distribution amplifies it.
Google Antigravity is worth watching as the agentic platform angle develops. It sits alongside tools like the Gemini CLI as Google’s answer to the wave of AI-native developer tooling. Gemini 3.1 Pro as the default intelligence in that stack is a strategic position, not just a product launch.
For enterprise, the Vertex AI availability matters. Managed deployment, billing via existing GCP relationships, and enterprise SLAs are the practical reasons organisations choose Vertex over direct API access. Google has made sure 3.1 Pro is available there from day one of preview.
Where 3.1 Pro fits in the practical developer choice
The frontier model comparison as of March 2026:
Claude Opus 4.6 – strong complex reasoning, 1M context, $5/$25 per million input/output. Expensive. Still best-in-class for specific reasoning-heavy tasks where quality is the only variable that matters.
Claude Sonnet 4.6 – faster, $3/$15, 1M context. The practical workhorse for most production use cases. See the Claude Code platform post for context on the Anthropic ecosystem.
Gemini 3.1 Pro – #1 AAII, 77.1% ARC-AGI-2, 1M context, $2/$12, 120 t/s. The most cost-competitive option at this intelligence tier, but verbosity means real costs vary. Preview status means you’re not building on a GA commitment yet.
GPT-5 – strong on coding and writing, 128K context. Capable, but the context window ceiling is a real constraint for document-heavy and codebase-scale tasks.
There is no single winner across all tasks. Gemini 3.1 Pro’s value proposition is clearest for reasoning-heavy tasks where you want frontier-level intelligence at a competitive price and can tolerate preview constraints. For high-volume production, verbosity modelling is required before committing. For latency-sensitive agentic workflows, the 120 t/s throughput is a genuine advantage.
The Chinese model field is also worth tracking here – MiniMax M2.7 and others are closing the gap in specific areas, which adds pressure on the Western frontier. Recent releases like Mistral Small 4 show the smaller-model tier is also moving fast.
The reasoning capability in Gemini 3.1 Pro is real. The ARC-AGI-2 jump is not a benchmark gaming story – it reflects genuine progress on a test designed to be hard to fake. The verbosity requires cost modelling before headline prices apply to your workload. The preview status means you’re validating, not committing.
At the frontier, the question is never which model is best in the abstract. It’s which model is best for the specific task, volume, and cost tolerance you’re working with. Gemini 3.1 Pro has earned a place in that evaluation.