NVIDIA Vera Rubin: What 10x Cheaper Inference Actually Means

16 March 2026 - 9 mins read

Specs verified against NVIDIA newsroom announcements, March 16, 2026. Some per-GPU specs are NVIDIA-stated figures; independent verification pending.

What’s New This Week

Initial publication – NVIDIA announced the Vera Rubin platform at GTC 2026 keynote on March 16, 2026. Post published same evening.

Changelog

Date	Summary
16 Mar 2026	Initial publication on GTC 2026 announcement day.

NVIDIA announced Vera Rubin tonight at GTC 2026. The headline specs are impressive: seven new chips in full production, five rack-scale systems, up to 10x higher inference throughput per watt over Blackwell. The NVL144 CPX variant packs 8 exaflops of AI compute and 100TB of fast memory into a single rack.

The number that matters most isn’t the exaflop count. It’s the last part of that claim: one-tenth the cost per token.

That number changes what you can afford to build.

What Vera Rubin Actually Is

The Vera Rubin platform is a full-stack redesign. Seven chips, all in production: the Rubin GPU, the Vera CPU, NVLink 6 Switch, ConnectX-9 SuperNIC, BlueField-4 DPU, Spectrum-6 Ethernet switch, and the newly integrated Groq 3 LPU. Five rack configurations built from those chips: the NVL72 GPU rack, Vera CPU rack, Groq 3 LPX inference rack, BlueField-4 STX storage rack, and Spectrum-6 Ethernet rack.

The flagship NVL72 integrates 72 Rubin GPUs with 36 Vera CPUs, connected by NVLink 6. NVIDIA’s claim for this system: up to 10x higher inference throughput per watt and one-tenth the cost per token compared with the Blackwell platform.

Ships H2 2026. Not available yet.

There’s also the Rubin CPX variant, announced separately at NVIDIA’s AI Infra Summit. The Vera Rubin NVL144 CPX packs 8 exaflops of AI compute alongside 100TB of fast memory and 1.7 petabytes per second of memory bandwidth in a single rack – 7.5x more AI performance than the GB300 NVL72. The CPX GPU itself delivers 30 PFLOPS at NVFP4 precision with 128GB of GDDR7 memory and 3x faster attention processing than current GB300 systems. Available end of 2026.

These are NVIDIA-stated numbers. The independent benchmarks will follow once hardware ships. But the architecture decisions behind these numbers are legible and worth examining now.

What 10x Cheaper Tokens Actually Unlocks

Every LLM API call has a cost. At current Blackwell-era prices, some use cases are economically marginal – not impossible, just expensive enough that the ROI math doesn’t work cleanly.

Consider the categories that sit at that margin today:

Long-context document analysis. Running a 100K-token context through a model multiple times per document, at scale, is expensive enough to make many use cases feel like engineering excess rather than good product design. At one-tenth the cost, the same analysis pipeline becomes obviously worth building.

High-frequency agent calls. Agentic systems make many small inferences – tool selection, result evaluation, next-step planning. At current prices, keeping those calls cheap means constraining the model, limiting context, or batching aggressively. At 10x lower cost, you can be less clever about it and more generous with context.

Real-time inference at consumer scale. The difference between a feature that runs on-demand versus one that runs in the background, proactively, for every user – that’s often a cost decision dressed up as a product decision. Token costs set the ceiling on what you can run at consumer scale without burning your margin.

None of this is hypothetical. It’s the same dynamic that played out with storage costs, compute costs, and bandwidth costs over the last three decades. When the cost of a resource drops by an order of magnitude, the applications that were previously too expensive to build become the obvious next wave.

The question isn’t whether 10x cheaper tokens will change what gets built. It will. The question is whether you’ve designed your current architecture to take advantage of it when it arrives – or whether you’ve over-optimised for today’s cost constraints in ways that will require rework.

The Memory Story: HBM4 and the KV Cache Bottleneck

The demand for high-bandwidth memory has been building for years. Vera Rubin is where that investment starts paying inference dividends.

The Rubin GPU is equipped with HBM4 memory at 22 TB/s of memory bandwidth per GPU – a substantial step up from the HBM3e in the current Blackwell generation. This isn’t just about raw speed. It’s targeted at a specific bottleneck: the KV cache.

In transformer inference, the key-value cache grows linearly with context length. Serving a 1M-token context window – now generally available in Claude and others – means holding an enormous amount of state in memory and reading it on every generated token. That makes long-context inference memory-bandwidth-bound, not compute-bound. Throwing more FLOPS at it doesn’t help if you’re waiting on memory.

HBM4 at 22 TB/s directly addresses this. More bandwidth means you can serve longer contexts at higher throughput before memory becomes the constraint. The 10x cost-per-token claim isn’t separate from the memory story – it’s partly a consequence of it. Serving the same context window requires fewer GPUs when each GPU can handle more KV cache traffic.

For engineers building on top of large-context models, this matters. The cost curves for applications that depend on long-context inference – code understanding at repository scale, document analysis, long-running agent state – will look different in 2027 than they do today.

The CPX Variant: 100TB Per Rack for Massive-Context Inference

The Rubin CPX is a distinct product with a specific purpose. NVIDIA describes it as “a new class of GPU purpose-built for massive-context processing.” That’s not marketing language – the hardware choices reflect it.

Where the standard Rubin GPU uses HBM4 for high bandwidth, the CPX uses GDDR7 at higher capacity: 128GB per GPU, optimised for the access patterns of long-context attention rather than raw bandwidth. The NVL144 CPX configuration combines standard Rubin GPUs with CPX processors in a single rack, giving you 100TB of fast memory total and 1.7 PB/s of aggregate bandwidth.

That 100TB figure is significant. It means you can hold the KV cache for very long context windows – millions of tokens – without spilling to slower storage or re-computing attention. NVIDIA is quoting 3x faster attention processing than GB300 NVL72 systems.

The companies already working with CPX specs are instructive: Cursor (million-token code context), Runway (long-form video generation, up to 1M tokens per hour of content), Magic (100M-token context for autonomous software engineering). These are all applications where the constraint isn’t model quality – it’s the cost and latency of processing enough context to make the model useful. CPX is the hardware designed to break that constraint.

NVIDIA is also quoting a monetisation number: $5 billion in token revenue for every $100 million invested in Rubin CPX infrastructure. That’s a 50x revenue-to-capex ratio. The number is aggressive and needs real-world validation, but the direction is clear – NVIDIA is selling this as an infrastructure investment with a calculable return, not just a performance upgrade.

The Custom Vera CPU: NVIDIA Becomes a Full-Stack Compute Company

This is the change that doesn’t get enough attention in the spec sheets.

Previous NVIDIA GPU racks ran AMD or Intel CPUs. The Vera Rubin platform ships with NVIDIA’s own Vera CPU – ARM-based, designed specifically for agentic AI workloads. The Vera CPU rack integrates 256 Vera CPUs in a liquid-cooled MGX chassis, delivering what NVIDIA claims is 2x better efficiency and 50% higher performance than traditional CPUs at equivalent tasks.

Building your own CPU is a significant investment and a significant statement. It means NVIDIA now owns the entire data path: CPU, GPU, NVLink interconnect, DPU, NIC, and switch fabric. Every component is designed to work together, with NVIDIA controlling the codesign tradeoffs at every layer.

The practical implications for inference workloads: the CPU handles orchestration, pre-processing, and tool execution in agentic pipelines. When the CPU-GPU interconnect is designed by the same team that designed both chips, you eliminate a class of bottlenecks that exist in heterogeneous stacks. Data doesn’t sit at the boundary waiting – it moves according to a unified memory and scheduling model.

This also puts NVIDIA in direct competition with AWS Graviton, AMD EPYC, and Apple Silicon in the data centre CPU market. Those are strong products from well-resourced teams. But NVIDIA’s advantage is vertical integration, not raw CPU performance. A Vera CPU that’s merely competitive on single-threaded performance but dramatically better at feeding Rubin GPUs is a win for the workloads that matter.

Jensen Huang positioned this explicitly at the keynote: “extreme codesign” is the core claim – software and silicon designed in tandem, the full stack optimised as a single system rather than a collection of best-of-breed components bolted together.

The trajectory is clear. NVIDIA is no longer a GPU company that sells into someone else’s infrastructure. It’s building the infrastructure.

Timeline Reality: Plan Now, Build Later

Vera Rubin ships H2 2026. The CPX variant ships end of 2026. Cloud providers will start taking delivery late 2026; realistically, developers will have access to Vera Rubin cloud instances in early 2027.

That’s not a reason to wait. It’s a reason to plan.

Architecture decisions made today – which model generation to build on, what token cost assumptions to bake into your unit economics, how aggressively to optimise for current cost constraints – all carry forward. Systems designed around $X per million tokens don’t automatically become more efficient when the market price drops to $X/10. If you’ve over-engineered your prompt compression and batching logic to compensate for expensive inference, you’ll spend time unwinding that when cheap inference arrives.

The useful framing: what use cases are you not building today because the inference economics don’t work? Make a list. Some of those will become viable in 2027 when Vera Rubin cloud pricing settles. Getting the product design right now – even before the hardware ships – means you’re ready to move when the cost curve shifts.

NVIDIA also previewed Feynman, the generation after Vera Rubin. A new CPU (Rosa, named for Rosalind Franklin), a new LPU (LP40), new networking. If the cadence holds at roughly annual, Feynman is late 2027. The hardware trajectory isn’t slowing down.

The performance numbers – 8 exaflops, 7.5x vs Blackwell, 30 PFLOPS per CPX GPU – matter for the hyperscalers and data centre operators doing procurement planning this quarter.

The cost number – one-tenth the cost per inference token – matters for every engineer deciding what to build next.

Those are different audiences, but the second one is larger. The hardware is the supply side. The applications built on top of it are the demand side, and that demand hasn’t fully shown up yet because the economics haven’t justified it. When the cost floor drops by an order of magnitude, the demand picture changes.

The question worth sitting with: what are you not building today that you should start designing now?