Commissioned, Curated and Published by Russ. Researched and written with AI.
What’s New This Week
Initial publication. No prior version.
Changelog
| Date | Summary |
|---|---|
| 22 Mar 2026 | First published. |
Agentic AI has a cost problem that’s hard to see until you’re already paying for it.
A single-turn chat completion costs roughly what you’d expect. An agentic workflow is different. Tools get called, outputs get appended, context accumulates across dozens of steps. By the time your agent has done anything interesting, it’s already burned through 50,000 tokens of context – and that’s before it starts writing output. Multiply that across a fleet of agents running continuously, and you’re not paying chat prices anymore.
At Claude Opus 4.6 pricing – $25 per million output tokens, up to $150 with priority access – routing every agentic action through a frontier API is becoming economically unviable at scale. The maths stops working somewhere between “prototype” and “production.”
The obvious response is routing. Use a local or open-weight model for the routine steps – sorting, parsing, classifying, deciding whether to call a tool – and only escalate to the frontier when you actually need it. The problem is that for the last 18 months, the open-weight tier hasn’t been good enough to trust with anything load-bearing.
That’s changing.
The Open-Source Void
Three of the most credible open-weight programmes are, simultaneously, in trouble.
Meta has slowed its Llama release cadence. DeepSeek R2 is facing delays – training instability on Huawei Ascend chips. The Qwen team at Alibaba has seen high-profile departures. None of these are fatal. All of them create a gap.
Nvidia stepped into it.
Nemotron 3 Super is a 120B parameter model with 12B active (Mixture of Experts architecture), pre-trained on 25T+ tokens, with a 1M token context window. On the Artificial Analysis Intelligence Index it scores 36 – ahead of Meta’s GPT-OSS-120B, slightly behind Qwen3.5-122B. That’s competitive enough to be taken seriously as a routing tier for agentic work. But the architecture is what makes it interesting.
What the Architecture Actually Does
Three things are worth understanding here, because they’re not just benchmark-padding – they directly address the problems that make long-context agentic deployments hard.
Hybrid Mamba-Transformer. Standard Transformer attention has memory that scales quadratically with context length. Double the context, quadruple the memory. Mamba layers use a different mechanism – state space models – where memory stays roughly flat regardless of sequence length. Nemotron 3 Super integrates Mamba layers alongside standard Transformer blocks. The result is a model that handles 1M token contexts without the memory wall that kills pure Transformer architectures at long range. It outperforms both GPT-OSS-120B and Qwen3.5-122B on the RULER long-context benchmark at 1M tokens. That’s not incidental to the agentic use case – it’s the whole point.
LatentMoE. Tokens are projected into a smaller latent dimension before expert routing and computation. This reduces the communication overhead that normally makes large MoE models slow in practice, and allows roughly 4x more experts within the same compute budget. More experts means more specialisation; lower communication overhead means you actually get to use them efficiently.
Multi-Token Prediction (MTP). Rather than predicting one token at a time, the model predicts several future tokens simultaneously. Faster time-to-first-token, better coherence on reasoning tasks.
Together: a model that’s architecturally designed to stay fast at long context. Relevant if you’re routing agents through it continuously.
The Hardware Strategy
Nemotron 3 Super is trained natively in NVFP4 format from the ground up. This is not a footnote.
NVFP4 delivers peak throughput only on Blackwell GPUs – the B200 and RTX Pro 6000 Blackwell. On Hopper or older hardware, the model falls back to FP8. The throughput advantage is Blackwell-specific: NVIDIA’s technical report (March 11, 2026) claims up to 2.2x higher throughput than GPT-OSS-120B and 7.5x higher than Qwen3.5-122B on B200 GPUs. Artificial Analysis confirmed 11% higher throughput per B200 over GPT-OSS-120B when comparing NVFP4 to MXFP4.
This is not altruism. It’s a hardware sales strategy.
Nvidia releases the weights for free. The weights perform best on Nvidia’s newest silicon. Meta releases Llama to drive brand awareness and talent. Mistral releases weights to drive API subscriptions. Nvidia releases weights to sell GPUs. Different business model, same tactic: open-source as a market-shaping instrument.
That doesn’t make the model less useful. It just means understanding what you’re optimising for when you choose to run it.
Can You Run It Locally?
Honest answer: it depends heavily on your hardware.
Ollama is supported. Nemotron 3 Super is available at ollama pull nemotron-3-super:120b-a12b. Unsloth’s GGUF is on HuggingFace at unsloth/NVIDIA-Nemotron-3-Super-120B-A12B-GGUF. There’s a benchmark from LocalLLaMA showing around 62 t/s at 512K context on an RTX Pro 6000 Blackwell – which is usable.
There’s a known GGUF compatibility issue. Ollama’s MoE GGUF blobs are not compatible with upstream llama.cpp – the weight format for the MoE blocks is wrong. A workaround exists (build llama.cpp from source at a specific commit – see the NVIDIA Developer Forums thread), but it’s not fixed in mainline Ollama yet. This is being resolved, but check before depending on it.
On Apple Silicon: NVFP4 does not apply. Mac users run GGUF Q4 via Ollama without the Blackwell speed gains. Q4_K_M for a 120B MoE model needs roughly 65-70GB. That’s too large for 64GB unified memory. Nemotron 3 Nano (30B A3B or 4B) fits comfortably. If you specifically want Super on Apple Silicon, you need an M3 or M4 Ultra with 192GB.
On consumer Nvidia hardware: An RTX 4090 (24GB VRAM) cannot fit Super. For Super, you’re looking at 96GB+ VRAM – Tinybox-class hardware or professional cards. Nano 30B in quantised form runs fine on a 4090.
The short version: Super on Blackwell hardware is where the performance numbers come from. Nano is the practical local option for most engineers right now.
The Routing Architecture This Enables
Nemotron 3 Nano comes in 30B A3B and 4B parameter versions. It’s designed for high-frequency routine tasks in agentic pipelines – sorting, parsing, routing decisions. It’s also on Ollama: ollama pull nemotron-3-nano.
The pairing is the point. Super handles complex multi-step reasoning, long-context analysis, anything requiring genuine intelligence. Nano handles the high-volume routine steps. Frontier APIs (Claude, GPT-4) handle whatever genuinely needs cutting-edge capability.
This is the architecture that makes agentic work economically viable at scale: local hardware for inference on routine tasks, open-weight models where accuracy is good enough, frontier APIs reserved for hard problems. The cost profile flattens dramatically once you stop routing everything through $25/M token APIs.
The local inference moment is being built bottom-up: better hardware, better models, better tooling for routing between tiers. Nemotron 3 fits that stack if you’re on Blackwell hardware. It’s competitive even if you’re not.
What Nvidia has done is make the open-weight reasoning tier a serious option for the first time. Whether that converts to B200 sales for them is their problem. For engineers building agent infrastructure, the relevant question is whether the model is good enough and cheap enough to deploy. On both counts, the answer is getting harder to dispute.
Sources: NVIDIA technical blog – NVIDIA NIM model card – Artificial Analysis – Technical report – Unsloth GGUF – Ollama library – NVIDIA Developer Forums