This is the current version. View full version history →
What’s New
| Date | Update |
|---|---|
| 22 Mar 2026 | Qwen 3 replaces Qwen 2.5 as the self-hosting recommendation (72B at MMLU 83.1 with built-in thinking mode); Llama 4 Scout and Maverick added with a real-world coding performance caveat; Mac Studio cluster ($40K vs $780K H100) changes team-scale economics. |
Version History
Full changelog and snapshots →
Section 1: Anthropic (Claude)
Anthropic runs the tightest product line of the major labs – three tiers, clearly differentiated, with a consistent quality floor that other providers still struggle to match on coding and instruction-following.
Claude Haiku 4.5
The cost-performance sweet spot for high-volume tasks. At $1 per million input tokens and $5 per million output tokens, it competes directly with GPT-4o mini and Gemini 2.0 Flash – and holds its own. Where Haiku earns its place is consistency: it rarely produces the kind of confidently wrong output that haunts cheaper alternatives. For classification, summarisation, fast code completions, and routing in multi-agent pipelines, this is the model you run at scale.
Claude Sonnet 4.5 / 4.6
The workhorse. $3 per million input tokens, $15 per million output tokens. SWE-bench Verified: 79.6% on Sonnet 4.6 – the highest score from any model on that benchmark at the time of its release. This is the number that matters for engineering teams: SWE-bench tests real repository-scale coding tasks, not synthetic benchmarks.
The jump from Claude 3.7 Sonnet (62.3% SWE-bench) to Sonnet 4.6 was significant. Extended thinking, introduced with Claude 3.7, persists into the 4.x generation: the model can visibly reason through complex problems before responding. Turning extended thinking on materially helps with multi-file refactors, debugging unfamiliar codebases, and architectural reasoning. It costs more and takes longer – use it selectively, not by default.
Context window: 200K standard, 1M for specific use cases (requests over 200K input tokens carry a premium long-context rate). For most engineering workloads, 200K is sufficient.
Sonnet is the default recommendation for coding assistance, agent tasks, and long-context analysis. If you’re running Claude Code or Cursor backed by an API, this is what you want.
Claude Opus 4.5 / 4.6
The top of range. $5 per million input tokens, $25 per million output tokens – a significant drop from the original Opus 4 ($15/$75), which makes it more practically deployable.
SWE-bench: 72.5% on Opus 4.6 in standard configuration, with claims of 80.8% in extended modes. On knowledge work evaluated by Elo, Opus has a substantial advantage over Sonnet for tasks requiring judgment, complex reasoning across long documents, and interpreting vague or underspecified requirements.
When is the cost premium worth it? Genuinely hard problems: multi-step code reviews across large codebases, security audit tasks, architectural decisions where the model needs to hold a lot of context and reason carefully. For routine coding, Sonnet is the better call – Opus’s advantage is in the tail of hard problems, not average-case performance.
Anthropic’s tooling story is strong: function calling and structured output are reliable, which matters for agent architectures where a wrong tool call cascades. This is a differentiator worth pricing in.
Section 2: OpenAI
OpenAI has had a complicated few months. The product line expanded faster than most teams could track, several models have already been retired from ChatGPT (GPT-4o, GPT-4.1, o4-mini), and the pricing trajectory has been inconsistent. The just-released GPT-5.4 looks like an attempt to consolidate – one unified frontier model instead of a sprawling o-series vs GPT-series split.
GPT-4o and the Baseline
GPT-4o remains the model most engineering teams have the most experience with. It is now being retired from ChatGPT but remains available via API. Its main appeal was the combination of multimodal capability (vision, voice) with competitive coding performance. It is still a good choice for teams with existing integrations they don’t want to rebuild. For new builds in 2026, Sonnet 4.6 at comparable cost is the cleaner choice.
The o-Series: Reasoning Models
The o-series (o1, o3, o3-mini, o4-mini) was OpenAI’s answer to the observation that scaling compute at inference time – rather than just training time – produces measurable gains on hard problems. The results are real: o3-mini consistently achieves maximum clarity scores on technical tasks at a higher rate than GPT-4o, and o4-mini topped LongMemEval benchmarks on reasoning accuracy.
When to use an o-series model: math-heavy problems, formal verification tasks, complex multi-step reasoning where you need the model to work through a problem carefully before responding. When not to: fast iteration, high-volume tasks, anything latency-sensitive. Reasoning effort is configurable (none, low, medium, high, xhigh) but more effort means more latency and cost. Don’t default to xhigh and expect it to be cheap.
GPT-5.4
Released March 5, 2026. OpenAI calls it “most capable and efficient frontier model for professional work.” The headline features:
- Native computer use (no plugin required, the model can operate your desktop and cross-application)
- 1.05M token context window
- Reasoning effort configurable in-model (folds in the o-series capability without a separate model)
- GPT-5.4 Pro and GPT-5.4 Thinking variants
The unified model approach is the right architectural call. Having a single endpoint where you dial reasoning effort rather than picking between GPT-4o and o3-mini reduces integration complexity. First-week impressions from engineering teams are positive on quality but cautious on pricing at high context lengths.
OpenAI’s API reliability track record is the best in the industry for enterprise teams that need SLA guarantees. If your organisation is already deep in the OpenAI ecosystem and needs enterprise support contracts, GPT-5.4 is the defensible choice.
Section 3: Google (Gemini)
Google’s story in early 2026 is better than the narrative gives it credit for. Gemini has been consistently underrated by engineering teams who tried an early version and wrote it off.
Gemini 2.0 Flash
The most underrated model in the current landscape for cost-sensitive production use. Sub-second response times (0.21-0.37s), 15x lower cost than Gemini Pro, and higher rate limits. For high-volume API use where latency matters – routing, classification, generation at scale – Flash is worth serious evaluation against Claude Haiku and GPT-4o mini. Flash 2.5 Lite is reportedly 1.5x faster than the 2.0 generation at lower cost.
Gemini 2.5 Pro
Competitive with mid-tier Claude and GPT. The quality gap between Flash and Pro is smaller than the marketing suggests (benchmark averages were 64.3% for Flash vs 66.9% for Pro in side-by-side evaluations). The cases where Pro justifiably wins: complex reasoning over long documents, multi-turn tasks with rich context, and Google Workspace/Cloud integration where native connectivity matters.
Gemini 3.1 Pro
Google’s current frontier model (as of late February 2026). Benchmarks put it in competition with Claude Opus 4.6 and Sonnet 4.6, with GPT-5.2 as the OpenAI reference point at the time of its release. Google’s multimodal story is strongest at this tier – native video, audio, and image understanding at frontier quality. The 1M+ context window is a genuine differentiator for teams doing document-heavy work: contract analysis, codebase understanding, or report synthesis across very large inputs.
Note on the Gemini 1.5 Pro safety token billing bug: An earlier version of Gemini 1.5 Pro was observed billing for safety-related tokens that weren’t part of useful output. Verify your billing assumptions if migrating from older Gemini versions.
Google’s enterprise angle (Workspace, Cloud) is a genuine advantage for teams already in that ecosystem. The IDE and tooling integrations are catching up to what OpenAI offers.
Section 4: Chinese Open Weights (The Real Story)
The Western narrative consistently underestimates what has happened here. The Chinese labs – DeepSeek, Alibaba (Qwen), and others – have shipped open-weight models that run locally, perform near frontier, and are available under permissive licenses. This changes the self-hosting calculus entirely. You no longer have to compromise significantly on quality to get a model you can run on your own infrastructure.
DeepSeek
DeepSeek’s V3 was the opening shot: trained for a reported ~$5.5 million (versus the hundreds of millions required for GPT-4-scale models) using a Mixture of Experts architecture that made the efficiency gains possible. The MoE approach means only a subset of parameters are active per token, which enables a much larger parameter count than would otherwise be feasible at that training cost. The quality on coding and reasoning tasks was credibly competitive with GPT-4.
DeepSeek R1 followed as the reasoning model, released January 2025 under the MIT license. It scores 79.8% on AIME 2024 and 97.3% on MATH-500 – putting it in the same performance tier as OpenAI’s o1 series. It runs locally. MIT license means you can use it without restriction. A May 2025 update (R1 0528) improved performance further while retaining the MIT license and the same hardware requirements.
The DeepSeek family has continued iterating: V3.2 is the current general-purpose release, with further versions in the pipeline. For teams evaluating self-hosted inference, DeepSeek V3 (general purpose) and R1 (reasoning) are the first models to evaluate seriously. The hardware requirements are significant for the full models, but quantised versions run on less.
Qwen 3 (Alibaba)
Alibaba’s Qwen 3 family, released in 2025, is the current self-hosting recommendation for general purpose inference. It replaces Qwen 2.5 across the board.
Qwen 3 72B: MMLU 83.1, HumanEval 84.2. The general-purpose self-hosting leader. For teams with hardware that can run 70B-class models, this is the first model to evaluate. Its predecessor (Qwen 2.5 72B) matched or surpassed Llama 3-405B despite being a fifth the parameter count – Qwen 3 pushes that further.
Qwen 3 32B: MMLU Pro 65.54. The accessible option for coding and general tasks. Replaces Qwen 2.5-Coder 32B as the primary recommendation at this size class, though Qwen 2.5-Coder 32B remains a valid choice for coding-specific workloads.
Qwen 3 235B A22B: MMLU-Pro CS 87.4%, competitive with GPT-4o’s 88.0% on that benchmark. The top of the Qwen range for teams with the hardware to run it.
Thinking mode: Qwen 3 has built-in reasoning capability with a per-request toggle between standard and thinking modes. No separate model needed – the same model switches behaviour on demand. This is a material advantage over architectures where you have to pick between a fast model and a reasoning model at deployment time.
Edge/embedded range: The Qwen 3 family includes models from 0.6B to 8B parameters (0.6B, 1.7B, 4B, 8B) for constrained hardware. Qwen 3 at sub-4B is significantly stronger than equivalent Qwen 2.5 models at the same size.
Qwen 3 API (via Alibaba Cloud): For teams that want Qwen 3 quality without the self-hosting overhead, cloud API access is available at competitive pricing.
Llama 4 (Meta)
Meta released Llama 4 in April 2025. The architecture is a genuine departure – Mixture of Experts throughout, and native multimodal capability baked in at the model level, not bolted on.
Llama 4 Scout: 109B total parameters, 17B active (MoE). Native text and vision understanding. MMLU Pro 80.5, GPQA Diamond 69.8. Q4 quantisation needs approximately 55GB VRAM. With Unsloth’s 1.78-bit quantisation, it fits in 24GB – bringing it into reach of a single RTX 4090 or 5090.
Llama 4 Maverick: 400B+ total parameters, 17B active (MoE). MMLU 85.5, GPQA Diamond 69.8. Beats GPT-4o on MMMU (multimodal) benchmarks – notable for an open-weight model. Q4 needs approximately 294GB, which is not practical for most local setups. Unsloth 1.78-bit quantisation brings it to roughly 96GB, which is reachable with a Mac Studio M4 Ultra (192GB) or a multi-GPU server.
The benchmark-vs-reality caveat: Rootly benchmarked Llama 4 against coding-centric models and found real-world coding performance falls short of the headline benchmark numbers. The gap between benchmark scores and production coding results is notable. This doesn’t mean Llama 4 is bad – it means evaluate it on your actual tasks before committing, especially for coding-heavy workloads.
Who should look at Llama 4? Teams that need native multimodal inference locally, where the alternative is either a cloud API or running a separate vision model alongside a text model. Scout is the realistic self-hosting option. Maverick is notable on paper and relevant if you have the infrastructure.
Mistral (European, But Worth Noting Here)
Mistral is French, not Chinese, but fits the open-weights-first philosophy of this section. Mistral Large competes with the top commercial models on general tasks. Mistral Small is a cost-effective API option. Codestral is Mistral’s code-specific model – strong on code generation and completion, worth evaluating if you’re building a coding tool and want a model with a clear commercial license and European data residency.
Le Chat is Mistral’s consumer interface, less relevant for engineering teams but notable as a sign of Mistral’s product ambitions beyond API provision.
The theme across all of these: the gap between closed frontier models and open-weight models has closed substantially. The top open-weight models are not 2-year-old versions of GPT-4. They are genuinely current, frequently updated, and in some cases (coding, reasoning) competitive with models you’d pay $15 per million tokens for.
Section 5: Self-Hosted – The Engineering Decision
This is where the decision gets interesting. Self-hosting was once a clear trade-off: worse models, more operational overhead, but better privacy and economics. That calculus has shifted. The open-weight models are good enough that self-hosting is now a legitimate architectural choice for quality-sensitive workloads, not just a cost-cutting measure.
Why Self-Host?
Privacy. Your code, proprietary data, and customer data don’t leave your infrastructure. For teams building on customer data, handling regulated information, or working in air-gapped environments, this isn’t optional – it’s a hard requirement.
Cost at scale. Per-token billing adds up fast at production volumes. The break-even point against hardware amortisation depends heavily on your usage patterns, but at 50M+ tokens per month, the economics of self-hosting become compelling.
Latency. No network round-trip. For latency-sensitive applications, a model running on local hardware can be faster than a frontier API even accounting for the smaller parameter count.
Control. No API deprecations. No rate limits. No outages from a provider’s infrastructure. No model behaviour changes pushed silently. You run the specific model version you’ve validated.
Air-gap requirements. Some regulated environments can’t call external APIs at all. Open-weight models are the only answer.
Hardware: What You Actually Need
Consumer tier (M2/M3/M4 Mac, 32GB+ unified memory): Runs 7B-13B models comfortably at useful speeds. A 70B model quantised to Q4_K_M will run, but slowly – think 5-10 tokens per second, which is borderline for interactive use. An M4 Max (128GB) runs 70B well enough to be genuinely useful.
Prosumer (RTX 4090, 24GB VRAM, or RTX 5090, 32GB VRAM): 7B-13B at full speed. 70B quantised at Q4 is possible on the 5090’s 32GB but tight. Better to pair two cards if you’re serious about 70B on GPU.
Server tier (2x A100 80GB): 70B at full quality, fast. 405B models quantised are possible. This is the minimum for production team inference at useful throughput.
Apple Silicon – the standout story for 2026: The M-series unified memory architecture means CPU and GPU share the same memory pool. A Mac Studio M4 Ultra with 192GB unified memory runs a 70B model at full (unquantised) quality at roughly 30 tokens per second. That’s fast enough for team-shared inference. Cost: approximately $5,000. An engineering team of 5 burning 100M tokens per month at Claude Sonnet prices ($3 input / $15 output) spends $1,800+ monthly on API costs alone. Break-even on hardware is fast, and the privacy benefit is immediate.
The cluster story (March 2026): Four Mac Studios linked over Thunderbolt 5 RDMA delivered 1.5TB of unified memory and ran Kimi K2 at 25 tokens per second. Total hardware cost: approximately $40,000. The equivalent Nvidia H100 cluster: approximately $780,000. This is team-scale inference at a price point that was previously impossible outside a well-funded ML team. The economics of multi-model, multi-user self-hosting have changed. See also: local AI hardware comparison for a full decision guide.
For Apple Silicon, llama.cpp with Metal support is significantly faster than vLLM, which is still primarily CUDA-optimised. Ollama (which wraps llama.cpp) is the easiest path on Mac.
Tooling
Ollama is the entry point. ollama pull qwen3:72b and you’re running. Built-in OpenAI-compatible API means your existing code talks to it with a URL change. The model library is comprehensive and updated within hours of major releases. For teams that want to try self-hosting without infrastructure overhead, start here.
LM Studio is Ollama with a GUI. Better for non-terminal users, includes a built-in model browser, and exposes the same OpenAI-compatible server. Worth knowing about if you have team members who won’t use a command line.
llama.cpp is the engine underneath everything. Using it directly gives maximum control over quantisation options, batch sizes, and Metal vs CUDA acceleration. If you’re doing something Ollama doesn’t expose, drop to llama.cpp.
vLLM is the production inference server. PagedAttention reduces memory fragmentation, enabling dramatically higher throughput on GPU hardware. If you’re serving a team or building a production API endpoint, vLLM’s batching and throughput characteristics are meaningfully better than Ollama. Note: still primarily CUDA-optimised; don’t run vLLM on Apple Silicon expecting the same gains.
Hugging Face TGI (Text Generation Inference) is an alternative production server with strong ecosystem integration – if you’re already pulling models from Hugging Face, TGI fits naturally. Comparable to vLLM in throughput characteristics.
Which Models to Self-Host
Coding: Qwen 3 32B is the current recommendation. Fits comfortably on hardware that supports 32B (Q4_K_M is about 19GB). HumanEval performance competitive with hosted mid-tier models. If you have the hardware for 70B, Qwen 3 72B is worth evaluating. Qwen 2.5-Coder 32B remains a valid alternative for pure coding workloads.
General purpose: Qwen 3 72B has the strongest benchmark performance and the advantage of built-in thinking mode. Llama 3.3 70B has the broadest tool support – nearly every inference framework optimises for Llama first. Both are solid choices; pick based on whether benchmark ceiling or ecosystem breadth matters more to you.
Reasoning: DeepSeek R1 (or the updated R1 0528). MIT license, o1-tier performance on reasoning benchmarks, runs locally at quantised Q4_K_M. If you need a model that can work through hard problems and your privacy requirements prevent calling external APIs, this is the answer.
Small and fast: Qwen 3 8B (strong for its size, with thinking mode), Qwen 3 4B (edge inference), Llama 3.2 3B (for very constrained environments). These run on almost any hardware including 8GB Apple Silicon M-series.
Multimodal: Llama 4 Scout if you need native multimodal locally and have hardware in the 24-55GB VRAM range. LLaVA and Qwen-VL remain options. Llama 3.2 Vision if you’re already on the Llama ecosystem and hardware is limited.
Quantisation: What the Letters Mean
Models are distributed at various quantisation levels that trade size (and hardware requirements) for quality. The format you’ll encounter most often via Ollama and GGUF files:
- Q8_0: Near-lossless. About half the size of a full-precision model. Use this if you have the VRAM and want maximum quality.
- Q4_K_M: The sweet spot for most use cases. Approximately 70% size reduction from full precision, less than 5% quality degradation on standard benchmarks. This is what you should run by default.
- Q2_K: Aggressive compression. Noticeable quality degradation on complex reasoning tasks. Only use when you’re hardware-constrained and need to run a model that otherwise won’t fit.
A 70B model at Q4_K_M requires about 40GB of memory. A 32B model at Q4_K_M needs about 19GB. Plan accordingly.
Section 6: How to Choose
A decision framework. Pick the right category, then pick the model.
Coding agent (Claude Code, Cursor, Copilot, etc.): Claude Sonnet 4.6 with extended thinking for complex refactors and multi-file work. Claude Haiku 4.5 for fast completions where latency matters. If self-hosted: Qwen 3 32B – it is genuinely competitive with hosted mid-tier on most coding tasks. Qwen 3 72B if you have the hardware.
Production API, cost-sensitive: Gemini 2.0 Flash or Claude Haiku 4.5 for high volume. Flash wins on latency and rate limits; Haiku wins on instruction-following consistency. Evaluate both against your specific workload.
Production API, quality-sensitive: Claude Sonnet 4.6 or GPT-5.4. Sonnet at $3/$15 per million tokens is the value leader on quality tasks. GPT-5.4 for teams needing enterprise SLAs or native computer use.
Long-context (legal docs, codebases, large reports): Gemini 3.1 Pro (1M+ context, strong multimodal). Claude Sonnet 4.6 (200K standard, 1M premium tier). Google’s context handling at extreme lengths has improved significantly; test it against your actual documents, not just marketing claims.
Reasoning and hard problems: GPT-5.4 Thinking, Claude Opus 4.6, DeepSeek R1 (or R1 0528) if self-hosted. The reasoning models have measurably better performance on math, formal verification, and complex multi-step tasks. Don’t use them by default for average-case queries – the cost and latency don’t justify it.
Self-hosted, general purpose: Qwen 3 72B via Ollama on Apple Silicon, or Llama 3.3 70B if you prioritise ecosystem breadth. Qwen 3’s built-in thinking mode is a practical advantage – you get reasoning capability without maintaining a separate model. Mac Studio M4 Ultra is the hardware recommendation for teams without a GPU server.
Self-hosted, multimodal: Llama 4 Scout. 17B active parameters (MoE), native vision, fits in 24GB VRAM with aggressive quantisation. Evaluate on your actual vision tasks before committing – the benchmark-vs-reality gap is notable for coding, but multimodal benchmarks are more favourable. See hardware decision guide for setup details.
Self-hosted, coding focus: Qwen 3 32B. Full stop.
Air-gapped or regulated environments: DeepSeek R1 (MIT license, no telemetry) or Llama 3.3 70B (Meta’s license, widely audited). Both are fully open weights, no phoning home, no dependency on external services.
Agent architectures: Tool-use reliability matters more than raw benchmark scores. Claude models (Sonnet 4.6 and above) have the strongest track record here. Test tool-calling consistency on your specific tool schema before committing to a provider – the failure modes are subtle and workload-specific.
For a detailed benchmark comparison across open and frontier models, see open vs frontier accuracy. For the broader architectural argument for local inference, see the local inference moment.
Closing
The model landscape in early 2026 is the best it has ever been for engineering teams. The Chinese open-weights releases have broken the assumption that serious AI capability requires a cloud API contract. You can now run a model on a Mac Studio – or a cluster of them – that competes with what required a frontier API subscription a year ago. Qwen 3 72B with built-in thinking mode, running locally on $40,000 of Apple Silicon cluster hardware, is not a compromise. The $780,000 H100 alternative is not ten times better.
The frontier is still moving. This post will need updating. That’s the point – this is a Signal, and the signal here is that the landscape is moving fast enough that any static comparison ages out within weeks.
Check the changelog. The recommendations above are correct as of March 22, 2026.