Commissioned, Curated and Published by Russ. Researched and written with AI.


What’s New

Nvidia released Nemotron 3 Super on 11 March 2026, the same day Wired published the $26 billion figure from Nvidia’s 2025 SEC filings. Bryan Catanzaro (VP of applied deep learning research at Nvidia) confirmed in the Wired interview that Nvidia has also finished pretraining a separate 550B-parameter model. The open-weight ecosystem just changed shape.


Changelog

DateSummary
12 Mar 2026Initial publication.

A few months ago, Nvidia released Nemotron 3 Nano – a 31.6B-parameter model built for efficiency, the coordination layer for agentic workflows. Yesterday they released Nemotron 3 Super, the reasoning flagship. And buried in a 2025 financial filing, Wired found something that wasn’t in any press release: $26 billion committed to training open-weight AI models over the next five years.

That number changes the conversation.

What Nemotron 3 Super actually is

Super is a 120B total parameter, 12B active parameter model. The architecture combines three things that don’t usually appear together: Mamba-2 state space layers, Transformer attention layers, and a Mixture of Experts routing system. Nvidia calls this a hybrid Mamba-Transformer MoE backbone, and the combination is deliberate rather than experimental.

Mamba layers handle the bulk of sequence processing in linear time, which is what makes the 1M-token context window practical. Pure state space models can struggle with precise associative recall – finding one specific fact buried deep in a long context. The interleaved Transformer attention layers preserve that capability. The MoE layer then scales effective parameter count without scaling inference cost proportionally: only 12B parameters activate per token despite 120B being available.

The 10:1 total-to-active ratio matters for throughput. Multi-agent systems generate up to 15x the tokens of standard chat sessions – resending history, tool outputs, and reasoning traces at every turn. A dense 120B model would be unusable in that context. The MoE architecture keeps latency low when many agents are running concurrently.

Nvidia also introduced latent MoE, which compresses token embeddings into a low-rank space before routing decisions are made. This lets Super consult 4x as many experts for the same computational cost as a standard MoE at the same size – finer-grained specialization without the corresponding cost increase.

Two other design choices are worth noting. Multi-token prediction (MTP) trains specialized heads to predict several future tokens simultaneously, which improves chain-of-thought reasoning during training and enables built-in speculative decoding at inference – up to 3x wall-clock speedups for structured generation like code and tool calls. And Super is pretrained natively in NVFP4 – Nvidia’s 4-bit floating-point format – rather than being quantized after training. The model learns accurate representations within the constraints of 4-bit arithmetic from the first gradient update rather than having precision degraded at the end.

The training pipeline is unusually transparent. Pretraining ran on 25 trillion tokens (10 trillion unique, deduplicated, quality-filtered). Supervised fine-tuning used approximately 7 million samples drawn from a 40-million-sample post-training corpus. Reinforcement learning ran across 21 environment configurations generating 1.2 million environment rollouts – trajectory-based RL rather than static preference optimization.

Weights are on HuggingFace. Nvidia has also released the full training and evaluation recipe, the pretraining datasets, the post-training datasets, and the RL environments. This is more open than just releasing weights – it’s reproducible enough to build on.

The intended deployment pattern is Super as the reasoning engine handling complex multi-step tasks, Nano as the coordination layer handling targeted individual steps. For software development: Nano handles straightforward merge requests, Super handles tasks that require deep codebase understanding, proprietary models handle anything at the frontier.

The $26 billion filing

This number did not come from a press release. Wired’s Will Knight surfaced it from Nvidia’s 2025 SEC filings and confirmed it in interviews with Nvidia executives. The figure: $26 billion over the next five years earmarked for training open-weight AI models.

Nvidia has been releasing models since 2023 when it launched the first Nemotron. The $26 billion commitment signals this is not a side project. Bryan Catanzaro, who has led Nvidia’s applied deep learning research since 2011 and helped architect the company’s shift from gaming graphics to AI infrastructure, told Wired: “Nvidia is taking open model development much more seriously. And we are making a lot of progress.” He also confirmed Nvidia has finished pretraining a 550B-parameter model that hasn’t been released yet.

The strategic logic from Nvidia’s side is clear. Kari Briski, VP of generative AI software for enterprise, described it plainly: “We build it to stretch our systems and test not just the compute but also the storage and networking, and to build out our hardware architecture roadmap.” Training frontier models at this scale stress-tests Nvidia’s own hardware in ways that customer deployments cannot. The models are a testing vehicle as much as a product.

The customer conflict

Nvidia’s biggest customers are OpenAI, Google, Anthropic, and Meta. All of them buy Nvidia GPUs to train models. Nvidia is now also training models.

The tension is real but bounded. Nvidia is not building a ChatGPT competitor. They’re releasing weights on HuggingFace. Their models target enterprise deployment, self-hosted infrastructure, and agentic frameworks – not consumer products or frontier API services. In practice, Nvidia’s Nemotron and OpenAI’s API products serve different use cases and different buyers.

But the directional signal is unambiguous. Nvidia wants a presence at every layer of the AI stack: the chips, the software libraries (CUDA, TensorRT, NeMo), and now the models themselves. The closest structural analogy is a chipmaker who also provides the reference implementations that everyone builds on – except the scale here is $26 billion and the output is frontier-capable open-weight models.

The customer conflict is also self-limiting in a specific way. Nvidia’s models being open-weight means they don’t compete for API revenue – they compete for mindshare and deployment patterns. An engineer who builds a production system on Nemotron 3 Super is likely running it on Nvidia hardware. The model is the trojan horse for the inference stack.

Why open-weight specifically

Nvidia is not releasing closed models. They’re releasing weights, datasets, recipes, and training infrastructure. The strategic logic becomes clear when you look at what the alternative would mean.

The best US models from OpenAI, Anthropic, and Google are accessible only through cloud APIs. The weights for top Chinese models – DeepSeek, Qwen, models from Alibaba, Moonshot AI, Z.ai, MiniMax – are released openly. Many startups and researchers worldwide are currently building on Chinese open-weight models because those are the ones they can actually run, modify, and deploy privately. Nvidia releasing competitive open-weight models changes which foundational models get built on – and those models run optimally on Nvidia inference infrastructure.

Open-weight models that are optimized for Nvidia’s hardware create a defensible moat through the software layer. The NVFP4 pretraining is a concrete example: Super runs fastest on Nvidia Blackwell GPUs because it was designed and trained for that architecture from the start. You can run it elsewhere, but you’ll pay a performance tax. This is the same pattern playing out across self-hosted AI deployments – the hardware and the model becoming co-optimized to the point where switching one means reconsidering the other.

Nathan Lambert, an AI researcher at the Allen Institute for AI (Ai2) who leads the ATOM Project (American Truly Open Models), said it directly: “I’m a huge Nemotron fan.” Lambert is pushing for US government funding of open models as a sovereignty play – the argument being that allowing open-weight AI to remain dominated by Chinese labs creates long-term strategic risk. Nvidia’s $26 billion investment fits that framing and provides a counter-narrative that doesn’t require government intervention.

What this does to the open-weight ecosystem

The competitive landscape for open-weight models has been DeepSeek, Qwen, LLaMA, and Mistral. Meta’s LLaMA has been the de facto US open-weight standard, but Mark Zuckerberg signaled last year that Meta might not open-source all of its future models. OpenAI has GPT-OSS but it trails the company’s proprietary frontier by a significant margin.

Nvidia entering with $26 billion in committed compute changes the cost structure. Nvidia can train on its own hardware at cost – no GPU rental, no cloud markup, no access constraints. They have a 550B-parameter model already pretrained that hasn’t been released yet. The compute advantage is structural, not temporary.

The $26 billion also signals duration. A five-year commitment at this scale is not an experiment. It’s a decision that Nvidia intends to be a sustained, major presence in open-weight model development for the foreseeable future.

For engineers building systems on open-weight models today, the practical implication is that Nvidia is becoming a serious first-party option rather than an interesting alternative. Nemotron 3 Super outperforms GPT-4o-mini on throughput benchmarks and ranks first on PinchBench (a benchmark measuring model performance as an agent reasoning engine). Nathan Lambert’s endorsement from outside the vendor ecosystem matters here – it’s validation from someone whose job is to evaluate these models critically, not promote them.

The training recipe release is also significant for organizations that want to build on top of rather than just deploy Nemotron. Having the full pretraining and RL pipeline available means domain-specific variants are feasible for well-resourced teams, not just Nvidia’s internal researchers.

What it looks like from here

Nvidia invented the infrastructure layer for the AI era. $26 billion suggests they intend to be a serious player in the model layer too.

The hardware monopoly concern has existed since 2020 – the question of what happens to AI development when one company controls access to the chips required to train and run frontier models. Nvidia training frontier models doesn’t straightforwardly resolve that concern. It adds a new dimension to it.

The open-weight strategy is genuinely good for the ecosystem in the near term. Better open models, more competition with Chinese labs, a credible US-made alternative that organizations can self-host. The long-term question is what the field looks like when the company that built the infrastructure also trains the reference models that run best on that infrastructure – and has $26 billion committed to maintaining that position.

That’s not a reason to avoid Nemotron. Super is an impressive model with an unusually transparent training pipeline, available now, running on half a dozen inference platforms. For teams building multi-agent systems, it’s worth evaluating seriously.

But the $26 billion is the more important story. Nvidia is no longer a hardware company that dabbles in software. They’re a full-stack AI player with the compute advantage to compete at every layer simultaneously. The question isn’t whether they can – they demonstrably can. The question is what “winning” at the model layer means for a company that already owns the hardware layer beneath it.