Commissioned, Curated and Published by Russ. Researched and written with AI.
What’s New: 7 March 2026
OpenAI shipped GPT-5.4 on March 5, billed as its “most capable and efficient frontier model for professional work.” The headline feature is native computer use – the model can operate across your desktop and applications without a separate plugin layer. It ships in two variants: GPT-5.4 Pro (high performance) and GPT-5.4 Thinking (reasoning-focused), both with a 1.05M token context window. Pricing on prompts over 272K input tokens charges at 2x, so watch your costs on long-context workloads. First impressions from the engineering community are mixed – the capability step is real, but so is the price. Full comparison with Claude Sonnet 4.6 and Gemini 3.1 Pro is early; expect clearer benchmarks within the next two weeks.
Changelog
| Date | Summary |
|---|---|
| 7 Mar 2026 | Initial publication. GPT-5.4 added to OpenAI section. |
The “best AI model” question no longer has a single answer. It has four.
Best for coding. Best for cost at scale. Best for self-hosted inference. Best for agentic workflows. These were roughly converging categories a year ago – the frontier model was usually the right answer for all of them. That’s no longer true. The model landscape in early 2026 is deeper, more fragmented, and more genuinely competitive than it has ever been. Chinese open-weights releases have changed the self-hosting calculus. Google has gotten genuinely competitive on long-context. OpenAI has shipped a unified frontier model that folds reasoning in by default.
This post maps the terrain. It is opinionated. The goal is to tell you what to actually use.
Section 1: Anthropic (Claude)
Anthropic runs the tightest product line of the major labs – three tiers, clearly differentiated, with a consistent quality floor that other providers still struggle to match on coding and instruction-following.
Claude Haiku 4.5
The cost-performance sweet spot for high-volume tasks. At $1 per million input tokens and $5 per million output tokens, it competes directly with GPT-4o mini and Gemini 2.0 Flash – and holds its own. Where Haiku earns its place is consistency: it rarely produces the kind of confidently wrong output that haunts cheaper alternatives. For classification, summarisation, fast code completions, and routing in multi-agent pipelines, this is the model you run at scale.
Claude Sonnet 4.5 / 4.6
The workhorse. $3 per million input tokens, $15 per million output tokens. SWE-bench Verified: 72.7% on Sonnet 4.6 – the highest score from any model on that benchmark at the time of its release. This is the number that matters for engineering teams: SWE-bench tests real repository-scale coding tasks, not synthetic benchmarks.
The jump from Claude 3.7 Sonnet (62.3% SWE-bench) to Sonnet 4.6 was significant. Extended thinking, introduced with Claude 3.7, persists into the 4.x generation: the model can visibly reason through complex problems before responding. Turning extended thinking on materially helps with multi-file refactors, debugging unfamiliar codebases, and architectural reasoning. It costs more and takes longer – use it selectively, not by default.
Context window: 200K standard, 1M for specific use cases (requests over 200K input tokens carry a premium long-context rate). For most engineering workloads, 200K is sufficient.
Sonnet is the default recommendation for coding assistance, agent tasks, and long-context analysis. If you’re running Claude Code or Cursor backed by an API, this is what you want.
Claude Opus 4.5 / 4.6
The top of range. $5 per million input tokens, $25 per million output tokens – a significant drop from the original Opus 4 ($15/$75), which makes it more practically deployable.
SWE-bench: 72.5% on Opus 4.6 in standard configuration, with claims of 80.8% in extended modes. On knowledge work evaluated by Elo, Opus has a substantial advantage over Sonnet for tasks requiring judgment, complex reasoning across long documents, and interpreting vague or underspecified requirements.
When is the cost premium worth it? Genuinely hard problems: multi-step code reviews across large codebases, security audit tasks, architectural decisions where the model needs to hold a lot of context and reason carefully. For routine coding, Sonnet is the better call – Opus’s advantage is in the tail of hard problems, not average-case performance.
Anthropic’s tooling story is strong: function calling and structured output are reliable, which matters for agent architectures where a wrong tool call cascades. This is a differentiator worth pricing in.
Section 2: OpenAI
OpenAI has had a complicated few months. The product line expanded faster than most teams could track, several models have already been retired from ChatGPT (GPT-4o, GPT-4.1, o4-mini), and the pricing trajectory has been inconsistent. The just-released GPT-5.4 looks like an attempt to consolidate – one unified frontier model instead of a sprawling o-series vs GPT-series split.
GPT-4o and the Baseline
GPT-4o remains the model most engineering teams have the most experience with. It is now being retired from ChatGPT but remains available via API. Its main appeal was the combination of multimodal capability (vision, voice) with competitive coding performance. It is still a good choice for teams with existing integrations they don’t want to rebuild. For new builds in 2026, Sonnet 4.6 at comparable cost is the cleaner choice.
The o-Series: Reasoning Models
The o-series (o1, o3, o3-mini, o4-mini) was OpenAI’s answer to the observation that scaling compute at inference time – rather than just training time – produces measurable gains on hard problems. The results are real: o3-mini consistently achieves maximum clarity scores on technical tasks at a higher rate than GPT-4o, and o4-mini topped LongMemEval benchmarks on reasoning accuracy.
When to use an o-series model: math-heavy problems, formal verification tasks, complex multi-step reasoning where you need the model to work through a problem carefully before responding. When not to: fast iteration, high-volume tasks, anything latency-sensitive. Reasoning effort is configurable (none, low, medium, high, xhigh) but more effort means more latency and cost. Don’t default to xhigh and expect it to be cheap.
GPT-5.4
Released March 5, 2026. OpenAI calls it “most capable and efficient frontier model for professional work.” The headline features:
- Native computer use (no plugin required, the model can operate your desktop and cross-application)
- 1.05M token context window
- Reasoning effort configurable in-model (folds in the o-series capability without a separate model)
- GPT-5.4 Pro and GPT-5.4 Thinking variants
The unified model approach is the right architectural call. Having a single endpoint where you dial reasoning effort rather than picking between GPT-4o and o3-mini reduces integration complexity. First-week impressions from engineering teams are positive on quality but cautious on pricing at high context lengths.
OpenAI’s API reliability track record is the best in the industry for enterprise teams that need SLA guarantees. If your organisation is already deep in the OpenAI ecosystem and needs enterprise support contracts, GPT-5.4 is the defensible choice.
Section 3: Google (Gemini)
Google’s story in early 2026 is better than the narrative gives it credit for. Gemini has been consistently underrated by engineering teams who tried an early version and wrote it off.
Gemini 2.0 Flash
The most underrated model in the current landscape for cost-sensitive production use. Sub-second response times (0.21-0.37s), 15x lower cost than Gemini Pro, and higher rate limits. For high-volume API use where latency matters – routing, classification, generation at scale – Flash is worth serious evaluation against Claude Haiku and GPT-4o mini. Flash 2.5 Lite is reportedly 1.5x faster than the 2.0 generation at lower cost.
Gemini 2.5 Pro
Competitive with mid-tier Claude and GPT. The quality gap between Flash and Pro is smaller than the marketing suggests (benchmark averages were 64.3% for Flash vs 66.9% for Pro in side-by-side evaluations). The cases where Pro justifiably wins: complex reasoning over long documents, multi-turn tasks with rich context, and Google Workspace/Cloud integration where native connectivity matters.
Gemini 3.1 Pro
Google’s current frontier model (as of late February 2026). Benchmarks put it in competition with Claude Opus 4.6 and Sonnet 4.6, with GPT-5.2 as the OpenAI reference point at the time of its release. Google’s multimodal story is strongest at this tier – native video, audio, and image understanding at frontier quality. The 1M+ context window is a genuine differentiator for teams doing document-heavy work: contract analysis, codebase understanding, or report synthesis across very large inputs.
Note on the Gemini 1.5 Pro safety token billing bug: An earlier version of Gemini 1.5 Pro was observed billing for safety-related tokens that weren’t part of useful output. Verify your billing assumptions if migrating from older Gemini versions.
Google’s enterprise angle (Workspace, Cloud) is a genuine advantage for teams already in that ecosystem. The IDE and tooling integrations are catching up to what OpenAI offers.
Section 4: Chinese Open Weights (The Real Story)
The Western narrative consistently underestimates what has happened here. The Chinese labs – DeepSeek, Alibaba (Qwen), and others – have shipped open-weight models that run locally, perform near frontier, and are available under permissive licenses. This changes the self-hosting calculus entirely. You no longer have to compromise significantly on quality to get a model you can run on your own infrastructure.
DeepSeek
DeepSeek’s V3 was the opening shot: trained for a reported ~$5.5 million (versus the hundreds of millions required for GPT-4-scale models) using a Mixture of Experts architecture that made the efficiency gains possible. The MoE approach means only a subset of parameters are active per token, which enables a much larger parameter count than would otherwise be feasible at that training cost. The quality on coding and reasoning tasks was credibly competitive with GPT-4.
DeepSeek R1 followed as the reasoning model, released January 2025 under the MIT license. It scores 79.8% on AIME 2024 and 97.3% on MATH-500 – putting it in the same performance tier as OpenAI’s o1 series. It runs locally. MIT license means you can use it without restriction.
The DeepSeek family has continued iterating: V3.2 is the current general-purpose release, with further versions in the pipeline. For teams evaluating self-hosted inference, DeepSeek V3 (general purpose) and R1 (reasoning) are the first models to evaluate seriously. The hardware requirements are significant for the full models, but quantised versions run on less.
Qwen 2.5 (Alibaba)
Alibaba’s Qwen 2.5 family has quietly become one of the most important open-weights releases of the past year.
Qwen 2.5 72B: Matches or surpasses Llama 3-405B on most benchmarks despite being a fifth the parameter count. MMLU 85+, HumanEval 85+, MATH 80+. This is a remarkable result – a 72B model outperforming a 405B model on standard benchmarks. For teams with hardware that can run 70B models, Qwen 2.5 72B is the general-purpose leader.
Qwen 2.5-Coder 32B: The best open-weight coding model available right now. HumanEval performance competitive with Claude Sonnet 3.5. If you’re self-hosting a coding assistant and your hardware supports 32B (it’s more accessible than 72B), this is the call.
Qwen 2.5-Max: Cloud API offering scoring 90% on HumanEval at $2 per million tokens. Competitive for teams that want Qwen quality without the self-hosting overhead.
QwQ-32B: Qwen’s reasoning model, analogous to DeepSeek R1. Runs locally on accessible hardware. Worth evaluating alongside R1 for reasoning-heavy tasks.
Mistral (European, But Worth Noting Here)
Mistral is French, not Chinese, but fits the open-weights-first philosophy of this section. Mistral Large competes with the top commercial models on general tasks. Mistral Small is a cost-effective API option. Codestral is Mistral’s code-specific model – strong on code generation and completion, worth evaluating if you’re building a coding tool and want a model with a clear commercial license and European data residency.
Le Chat is Mistral’s consumer interface, less relevant for engineering teams but notable as a sign of Mistral’s product ambitions beyond API provision.
The theme across all of these: the gap between closed frontier models and open-weight models has closed substantially. The top open-weight models are not 2-year-old versions of GPT-4. They are genuinely current, frequently updated, and in some cases (coding, reasoning) competitive with models you’d pay $15 per million tokens for.
Section 5: Self-Hosted – The Engineering Decision
This is where the decision gets interesting. Self-hosting was once a clear trade-off: worse models, more operational overhead, but better privacy and economics. That calculus has shifted. The open-weight models are good enough that self-hosting is now a legitimate architectural choice for quality-sensitive workloads, not just a cost-cutting measure.
Why Self-Host?
Privacy. Your code, proprietary data, and customer data don’t leave your infrastructure. For teams building on customer data, handling regulated information, or working in air-gapped environments, this isn’t optional – it’s a hard requirement.
Cost at scale. Per-token billing adds up fast at production volumes. The break-even point against hardware amortisation depends heavily on your usage patterns, but at 50M+ tokens per month, the economics of self-hosting become compelling.
Latency. No network round-trip. For latency-sensitive applications, a model running on local hardware can be faster than a frontier API even accounting for the smaller parameter count.
Control. No API deprecations. No rate limits. No outages from a provider’s infrastructure. No model behaviour changes pushed silently. You run the specific model version you’ve validated.
Air-gap requirements. Some regulated environments can’t call external APIs at all. Open-weight models are the only answer.
Hardware: What You Actually Need
Consumer tier (M2/M3/M4 Mac, 32GB+ unified memory): Runs 7B-13B models comfortably at useful speeds. A 70B model quantised to Q4_K_M will run, but slowly – think 5-10 tokens per second, which is borderline for interactive use. An M4 Max (128GB) runs 70B well enough to be genuinely useful.
Prosumer (RTX 4090, 24GB VRAM, or RTX 5090, 32GB VRAM): 7B-13B at full speed. 70B quantised at Q4 is possible on the 5090’s 32GB but tight. Better to pair two cards if you’re serious about 70B on GPU.
Server tier (2x A100 80GB): 70B at full quality, fast. 405B models quantised are possible. This is the minimum for production team inference at useful throughput.
Apple Silicon – the standout story for 2026: The M-series unified memory architecture means CPU and GPU share the same memory pool. A Mac Studio M4 Ultra with 192GB unified memory runs a 70B model at full (unquantised) quality at roughly 30 tokens per second. That’s fast enough for team-shared inference. Cost: approximately $5,000. An engineering team of 5 burning 100M tokens per month at Claude Sonnet prices ($3 input / $15 output) spends $1,800+ monthly on API costs alone. Break-even on hardware is fast, and the privacy benefit is immediate.
For Apple Silicon, llama.cpp with Metal support is significantly faster than vLLM, which is still primarily CUDA-optimised. Ollama (which wraps llama.cpp) is the easiest path on Mac.
Tooling
Ollama is the entry point. ollama pull qwen2.5:72b and you’re running. Built-in OpenAI-compatible API means your existing code talks to it with a URL change. The model library is comprehensive and updated within hours of major releases. For teams that want to try self-hosting without infrastructure overhead, start here.
LM Studio is Ollama with a GUI. Better for non-terminal users, includes a built-in model browser, and exposes the same OpenAI-compatible server. Worth knowing about if you have team members who won’t use a command line.
llama.cpp is the engine underneath everything. Using it directly gives maximum control over quantisation options, batch sizes, and Metal vs CUDA acceleration. If you’re doing something Ollama doesn’t expose, drop to llama.cpp.
vLLM is the production inference server. PagedAttention reduces memory fragmentation, enabling dramatically higher throughput on GPU hardware. If you’re serving a team or building a production API endpoint, vLLM’s batching and throughput characteristics are meaningfully better than Ollama. Note: still primarily CUDA-optimised; don’t run vLLM on Apple Silicon expecting the same gains.
Hugging Face TGI (Text Generation Inference) is an alternative production server with strong ecosystem integration – if you’re already pulling models from Hugging Face, TGI fits naturally. Comparable to vLLM in throughput characteristics.
Which Models to Self-Host
Coding: Qwen 2.5-Coder 32B is the clear answer. Fits comfortably on hardware that supports 32B (Q4_K_M is about 19GB). HumanEval performance competitive with hosted mid-tier models. If you have the hardware for 70B, Qwen 2.5 72B or Qwen 2.5-Coder 72B are worth evaluating.
General purpose: Llama 3.3 70B has the broadest tool support – nearly every inference framework optimises for Llama first. Qwen 2.5 72B has stronger benchmark performance, particularly on multilingual tasks. Both are solid choices; pick based on your ecosystem.
Reasoning: DeepSeek R1. MIT license, o1-tier performance on reasoning benchmarks, runs locally at quantised Q4_K_M. If you need a model that can work through hard problems and your privacy requirements prevent calling external APIs, this is the answer.
Small and fast: Qwen 2.5 7B (strong for its size), Llama 3.2 3B (for edge inference), Gemma 2 9B (Google-backed, solid all-round). These run on almost any hardware including 8GB Apple Silicon M-series.
Multimodal: LLaVA (image understanding), Qwen-VL (vision-language, Qwen-quality multimodal), Llama 3.2 Vision (the clearest path if you’re already on Llama).
Quantisation: What the Letters Mean
Models are distributed at various quantisation levels that trade size (and hardware requirements) for quality. The format you’ll encounter most often via Ollama and GGUF files:
- Q8_0: Near-lossless. About half the size of a full-precision model. Use this if you have the VRAM and want maximum quality.
- Q4_K_M: The sweet spot for most use cases. Approximately 70% size reduction from full precision, less than 5% quality degradation on standard benchmarks. This is what you should run by default.
- Q2_K: Aggressive compression. Noticeable quality degradation on complex reasoning tasks. Only use when you’re hardware-constrained and need to run a model that otherwise won’t fit.
A 70B model at Q4_K_M requires about 40GB of memory. A 32B model at Q4_K_M needs about 19GB. Plan accordingly.
Section 6: How to Choose
A decision framework. Pick the right category, then pick the model.
Coding agent (Claude Code, Cursor, Copilot, etc.): Claude Sonnet 4.6 with extended thinking for complex refactors and multi-file work. Claude Haiku 4.5 for fast completions where latency matters. If self-hosted: Qwen 2.5-Coder 32B – it is genuinely competitive with hosted mid-tier on most coding tasks.
Production API, cost-sensitive: Gemini 2.0 Flash or Claude Haiku 4.5 for high volume. Flash wins on latency and rate limits; Haiku wins on instruction-following consistency. Evaluate both against your specific workload.
Production API, quality-sensitive: Claude Sonnet 4.6 or GPT-5.4. Sonnet at $3/$15 per million tokens is the value leader on quality tasks. GPT-5.4 for teams needing enterprise SLAs or native computer use.
Long-context (legal docs, codebases, large reports): Gemini 3.1 Pro (1M+ context, strong multimodal). Claude Sonnet 4.6 (200K standard, 1M premium tier). Google’s context handling at extreme lengths has improved significantly; test it against your actual documents, not just marketing claims.
Reasoning and hard problems: GPT-5.4 Thinking, Claude Opus 4.6, DeepSeek R1 if self-hosted. The reasoning models have measurably better performance on math, formal verification, and complex multi-step tasks. Don’t use them by default for average-case queries – the cost and latency don’t justify it.
Self-hosted, privacy-first: Qwen 2.5 72B via Ollama on Apple Silicon, or Llama 3.3 70B if you prioritise ecosystem breadth. Both perform well at Q4_K_M. Mac Studio M4 Ultra is the hardware recommendation for teams without a GPU server.
Self-hosted, coding focus: Qwen 2.5-Coder 32B. Full stop.
Air-gapped or regulated environments: DeepSeek R1 (MIT license, no telemetry) or Llama 3.3 70B (Meta’s license, widely audited). Both are fully open weights, no phoning home, no dependency on external services.
Agent architectures: Tool-use reliability matters more than raw benchmark scores. Claude models (Sonnet 4.6 and above) have the strongest track record here. Test tool-calling consistency on your specific tool schema before committing to a provider – the failure modes are subtle and workload-specific.
Closing
The model landscape in early 2026 is the best it has ever been for engineering teams. The Chinese open-weights releases have broken the assumption that serious AI capability requires a cloud API contract. You can now run a model on a Mac Studio that competes with what required a frontier API subscription a year ago. That’s a genuine shift, not incremental progress.
The frontier is still moving. GPT-5.4 shipped two days ago. Gemini 3.1 Pro shipped two weeks ago. Anthropic updated Sonnet and Opus to 4.6 last month. This post will need updating. That’s the point – this is a Signal, and the signal here is that the landscape is moving fast enough that any static comparison ages out within weeks.
Check the changelog. The recommendations above are correct as of March 7, 2026.