This post focuses on the architectural changes in Small 4 and what they mean for self-hosted deployments. For a broader overview of running your own inference stack, see Self-Hosting AI: What Actually Works in 2025.
What’s New This Week
Mistral released Small 4 this week – a 119B Mixture of Experts model under Apache 2.0, unifying reasoning, multimodal, and coding agent capabilities into a single deployment. The engineering headline: one model now does what previously required three.
Changelog
| Date | Summary |
|---|---|
| 17 Mar 2026 | Initial publication. |
Mistral Small 4 is a 119B Mixture of Experts model with 6B active parameters at inference, a 256K context window, configurable reasoning, native image understanding, and an Apache 2.0 licence. That’s the spec sheet.
The more interesting story is what it replaces. Until now, getting reasoning, multimodal input handling, and coding agent capability from Mistral meant running three separate models: Magistral for deep reasoning, Pixtral for images, Devstral for code agents. Small 4 consolidates all three into one. For teams self-hosting AI infrastructure, that distinction matters more than any individual benchmark number.
The Operational Cost of Specialised Models
Running multiple specialised models creates compounding overhead that’s easy to underestimate. You’re not just managing one deployment – you’re managing N endpoints, N routing rules, N prompting patterns, N evaluation harnesses, and N update cadences. Every time a model is updated, you need to re-evaluate it, re-tune your prompts, and regression-test the pipeline that depends on it.
The routing layer itself becomes a source of latency and failure. Simple classification (“is this a reasoning task or a coding task or a chat task?”) sounds trivial until you realise that many real requests sit at the boundary. A user asking your coding assistant to explain a diagram attached to a message needs at least two of the three capabilities. When those capabilities live in separate models, you either pick one and lose fidelity, or build a multi-step orchestration pipeline that calls multiple models sequentially.
Mistral’s thesis with Small 4 is that a unified model reduces this operational surface area. One deployment. One endpoint. One evaluation set. One update to track. Whether that trade-off is worth it depends on whether the unified model is actually capable across all three domains – and the benchmarks suggest it is. Small 4 beats GPT-OSS 120B on LiveCodeBench while producing 20% less output. On AA LCR it scores 0.72 with 1.6K characters, where Qwen models need 5.8-6.1K characters for comparable results. The efficiency advantage matters: shorter outputs mean lower latency and lower inference cost per request.
What reasoning_effort Actually Changes
The reasoning_effort parameter is the most practically significant feature in Small 4. Set it to none and you get fast, low-latency responses comparable to Small 3.2 chat mode. Set it to high and you get step-by-step chain-of-thought reasoning at the level of Magistral.
The operational implication: you can route requests by complexity within a single model deployment. Customer-facing chat queries with reasoning_effort=none get sub-second responses. Complex analytical requests with reasoning_effort=high get thorough chain-of-thought at the cost of higher latency.
Previously this required two separate model deployments – one fast instruct model and one reasoning model – with a routing layer sitting in front of them to decide which to call. That’s three things to operate instead of one. Small 4 collapses it to a parameter value.
The performance claims back this up: 40% reduction in end-to-end completion time in latency-optimised configuration, 3x more requests per second in throughput-optimised configuration, both measured against Small 3. The same hardware handles significantly more workload.
Apache 2.0: What It Actually Changes
Apache 2.0 means commercial use, modification, redistribution, and no copyleft obligation. You can fine-tune the weights, build a product on top of them, deploy it behind a paid API, and ship it to customers – without owing Mistral anything beyond attribution.
This matters because the open-weight model licensing landscape is messier than it appears. Meta’s Llama licence restricts commercial use above 700 million monthly active users, which is irrelevant for most teams but creates a compliance question that legal teams may not want to carry. Many other “open” models have use restrictions, non-commercial clauses, or acceptable-use policies that restrict certain industries or applications.
Apache 2.0 removes the compliance question. There’s nothing to interpret, no usage threshold to track, no categories of prohibited use to audit against. For enterprises building internal tooling or customer-facing products, that clarity is worth something.
It also means the model can be fine-tuned and redistributed without triggering copyleft. If you train a domain-specific variant for your use case, you own that without restriction.
Hardware Requirements and the MoE Cost Calculation
The minimum hardware requirement – 4x NVIDIA HGX H100, 2x HGX H200, or 1x DGX B200 – is enterprise infrastructure. This is not running on a gaming PC, and it’s not cheap to rent at scale. Anyone planning a self-hosted deployment should read the hardware context from GTC 2026 before sizing.
But the MoE architecture changes the cost calculation in a way that the total parameter count obscures. Small 4 has 119B total parameters across 128 experts. At inference, only 4 experts are active per token, which means the effective computation per token corresponds to 6B active parameters (8B including embedding and output layers). This is the same base architecture as Leanstral.
In practice, this means inference cost is determined by the 6B active parameters, not the 119B total. A dense 119B model would require far more compute per token. The MoE architecture gives you the capacity and specialisation of a very large model at the inference cost of a much smaller one. That changes what hardware is actually required for a given throughput target.
The 3x throughput improvement over Small 3 on the same hardware is the practical headline. If you’re already running Small 3, the same infrastructure handles three times the request volume with Small 4. For teams that have already made the infrastructure investment, the upgrade path is straightforward.
Ecosystem Support
Small 4 launched with day-one support on vLLM, llama.cpp, SGLang, Transformers, and HuggingFace. This matters for self-hosted deployments because it means no waiting for community-built integrations or unofficial forks.
vLLM and SGLang are the standard choices for high-throughput production inference. Both have been optimised in collaboration with NVIDIA for Small 4, which means you’re not working from generic MoE support – the inference paths are explicitly tuned.
The llama.cpp support is worth noting separately. llama.cpp can run quantised MoE models on more modest hardware than the official minimum, with quality trade-offs. For teams that want to evaluate Small 4 before committing to production hardware, or that need a development environment without full H100 access, quantised llama.cpp is a practical option. The quality degradation with aggressive quantisation on a MoE model is a real concern, but for prototyping and capability evaluation it’s adequate.
NVIDIA NIM containerised deployment is also available, which provides optimised inference out of the box without manual vLLM configuration. For teams already in the NVIDIA ecosystem, that’s the path of least resistance to production.
What Changed from Small 3
Small 3 was an instruct model – capable and efficient, but text-only and without a reasoning mode. Small 4 adds three things: configurable reasoning via reasoning_effort, native image input, and the coding agent capability previously in Devstral.
The operational improvements are the headline for teams already running Small 3. The 40% latency reduction means faster response times on the same hardware. The 3x throughput improvement means more requests per second before you need to scale. The same deployment now handles reasoning tasks and image inputs that previously required separate models.
The model landscape signal is that this kind of consolidation is the direction the field is moving. Specialised models were a stepping stone while capabilities matured in isolation. When a single model can match or exceed specialists across domains, the operational simplicity of consolidation becomes the obvious choice.
The Direction
Mistral is building toward a world where one capable open model handles most workloads. Small 4 is the clearest version of that thesis yet. One deployment, configurable reasoning effort, image understanding, coding agent capability, Apache 2.0, and day-one support on every major inference stack.
For teams evaluating whether to run their own inference, the question isn’t whether Small 4 is capable enough. The benchmarks settle that. The question is whether your organisation has or can justify the infrastructure – and for those that can, Small 4 meaningfully simplifies what that infrastructure needs to do.
The model that can replace three is operationally different from the model that’s best at one thing. That’s the bet Mistral has made, and with Small 4, it’s paying off.