Commissioned, Curated and Published by Russ. Researched and written with AI.


What’s New This Week

22 March 2026: Added model compatibility section – which open-weight models (Llama 4 Scout, Qwen 3 72B, DeepSeek R1 32B, Gemma 3 27B, Llama 4 Maverick) each hardware tier can actually run, with VRAM requirements and quantisation notes.


Changelog

DateSummary
22 Mar 2026Added per-tier model compatibility guide covering Llama 4 Scout, Qwen 3 72B, DeepSeek R1, Gemma 3 27B, and Llama 4 Maverick across all hardware options.
22 Mar 2026Initial publication comparing Tinybox, Apple Silicon, and Project Digits as local AI inference options.

Local AI inference is no longer niche. The question for engineering teams in 2026 is not whether to run models locally – it’s which hardware philosophy fits your workflow, budget, and tolerance for operational complexity.

There are three serious contenders right now: the Tinybox family from tinygrad, Apple Silicon (Mac Mini and Mac Studio), and Nvidia’s Project Digits. They are not really competing for the same customer. Understanding which one you actually need is the whole point of this piece.

This connects to a broader trend covered in self-hosting your AI stack and the local inference moment we are currently in. The hardware has caught up with the ambition.

The Three Philosophies

Before getting into specs, it is worth naming what each option is actually optimised for.

Tinybox: Give me raw GPU VRAM and I will handle everything else. George Hotz and the tinygrad team built these for researchers who want maximum memory, are comfortable on Linux, and will not mind the tinygrad framework. These are not plug-and-play devices.

Apple Silicon: Give me a system that just works. Unified memory eliminates the VRAM bottleneck that makes discrete GPU setups awkward for large models. The software ecosystem (Ollama, LM Studio, MLX, llama.cpp) is excellent and improving fast. Mac hardware runs quietly in an office.

Project Digits: Give me the Nvidia stack in a box. CUDA compatibility, TensorRT, the full Nvidia AI software stack preloaded on a compact desktop. The best option for teams already invested in the Nvidia ecosystem who want to move inference local without touching their software stack.

Tinybox: Specs and Tradeoffs

There are four variants. They differ substantially in cost and capability.

Tinybox red runs six AMD Radeon RX 7900XTX GPUs, giving 96GB total VRAM in a 12U rack unit. Price: $15,000. If your team runs AMD tooling or wants to avoid Nvidia licensing, this is the entry point.

Tinybox green swaps in six RTX 4090s for approximately 144GB total VRAM. Price: $25,000. The green v2 replaces the 4090s with four RTX 5090 GPUs – the current newest variant per the tinygrad docs.

Tinybox pro is the serious option: eight RTX 4090s, 192GB VRAM, 1.36 petaFLOPS FP16, two AMD Genoa EPYC processors. Rack-mount, intentionally loud, built for real workloads. Price: $40,000.

The software is tinygrad – open source, Linux-only, actively developed but not a general-purpose framework. If your team is not already comfortable with tinygrad or comfortable learning it, this adds real operational overhead.

The honest assessment: Tinybox makes sense for research teams that need a large contiguous GPU memory pool and want a single integrated rack unit running open-source software. It makes less sense if you need CUDA compatibility or a quiet office-friendly setup.

Apple Silicon: The Unified Memory Advantage

The architectural story here is different. Apple Silicon puts CPU and GPU on the same chip sharing the same memory pool. There is no VRAM bottleneck because there is no separate VRAM – just one unified memory pool with high bandwidth.

Mac Mini M4 at 24GB costs $599. It handles 7B to 13B parameter models comfortably. It draws roughly 20 to 30 watts under load. For individuals or teams that want to experiment with local inference without serious budget commitment, it is the obvious starting point.

Mac Mini M4 at 128GB costs approximately $3,199. At that memory size, 100B+ parameter models become viable with quantisation. That is a significant amount of inference capability for the price.

Mac Studio M4 Max starts around $1,999 and scales to 128GB unified memory with higher memory bandwidth than the Mini.

Mac Studio M3 Ultra tops out at 192GB unified memory – the highest in any single Apple consumer unit.

The killer data point is the cluster story. Four Mac Studios linked over Thunderbolt 5 RDMA gives 1.5TB of unified memory across the cluster. A team ran Kimi K2 on this setup at 25 tokens per second. Total cost: approximately $40,000. An equivalent Nvidia H100 setup would run roughly $780,000. That is not a rounding error – it is a different order of magnitude, and it is the case for Apple Silicon clusters in a way that the per-unit specs do not fully capture.

The software ecosystem is genuinely excellent. MLX (Apple’s own ML framework) is maturing fast. Ollama and LM Studio both run well on Apple Silicon. llama.cpp has first-class support. The main gap is CUDA – if your team’s tooling depends on CUDA, Apple Silicon is not the right fit.

Project Digits: Best Value Per Petaflop

Project Digits was announced in January 2025 and shipped in May 2025. Single unit: Nvidia’s GB10 Superchip (Grace CPU plus Blackwell GPU), 128GB LPDDR5X unified memory, 4TB NVMe SSD, approximately 1 petaFLOP FP16. Price: $3,000.

Two units linked together give 256GB unified memory and the ability to run 405B parameter models. Cost for two: $6,000.

The full Nvidia AI stack comes preloaded – CUDA, TensorRT, everything. For teams already on CUDA, this is the path of least resistance to local inference. No stack changes, no framework migrations.

The value comparison with Tinybox pro is worth stating directly. At $40,000, the Tinybox pro delivers 1.36 petaFLOPS and 192GB VRAM. For $6,000, two Project Digits units deliver approximately 2 petaFLOPS and 256GB unified memory. On raw specs per dollar, it is not close.

What does Tinybox pro offer that Digits does not? The tinygrad framework and open-source software stack, AMD or Nvidia GPU choice depending on variant, and a single integrated rack unit. For some teams those things matter. But the gap is worth naming honestly – if your primary concern is inference throughput per dollar, Digits wins at nearly every price point.

The Honest Comparison

Tinybox redTinybox proMac Mini M4 (128GB)Mac Studio cluster (x4)Project Digits (x2)
Price$15,000$40,000$3,199~$40,000$6,000
Memory96GB VRAM192GB VRAM128GB unified1.5TB unified256GB unified
Compute1.36 PFLOPS FP16~2 PFLOPS FP16
Softwaretinygrad / Linuxtinygrad / LinuxMLX, Ollama, etc.MLX, Ollama, etc.Full Nvidia stack
NoiseLoudVery loudSilentSilentQuiet
CUDANoNoNoNoYes

What Can Each Platform Actually Run?

Specs tables tell you memory numbers. This section tells you what that memory actually buys you in terms of models you can run today.

The benchmark model to think against is Qwen 3 72B. At Q4 quantisation it needs roughly 48GB of VRAM or unified memory. It is a genuinely capable model – strong at reasoning and coding – and it represents what “serious local inference” looks like in early 2026. If your hardware runs Qwen 3 72B Q4 comfortably, you have a real setup. If it does not, you are in the tier below.

The 24GB tier: more capable than it looks

Mac Mini M4 with 24GB is the entry point. At first glance, 24GB seems limiting. In practice, the arrival of MoE (mixture-of-experts) architectures makes this tier significantly more useful than the raw number suggests.

Gemma 3 27B (Google, 27B parameters) needs roughly 16-17GB at Q4 – it fits on the 24GB Mac Mini with room to spare. DeepSeek R1 32B and Qwen 3 32B both need around 20GB at Q4, and they both fit.

The interesting one is Llama 4 Scout (Meta, 2025). It has 109B total parameters but only activates 17B per token due to its MoE architecture. At Unsloth’s 1.78-bit quantisation, it squeezes into 24GB. Tight, but it works. A 109B MoE model on a $599 machine is not something that was possible eighteen months ago.

What the 24GB tier cannot run: Qwen 3 72B (Q4 needs ~48GB), and anything larger.

  • ✅ Gemma 3 27B (Q4, ~16-17GB – comfortable, fits with room for context)
  • ✅ Qwen 3 32B (Q4, ~20GB)
  • ✅ DeepSeek R1 32B (Q4, ~20GB – good reasoning model for the tier)
  • ✅ Llama 4 Scout (Unsloth 1.78-bit, ~24GB – tight but works)
  • ❌ Qwen 3 72B (Q4 needs ~48GB – doesn’t fit)
  • ❌ Llama 4 Maverick (too large for this tier)

The 128GB sweet spot

Mac Mini M4 at 128GB, Mac Studio M4 Max at 128GB, and a single Project Digits unit all land in the same capability tier. This is where Qwen 3 72B Q4 runs comfortably, with headroom for longer context windows. Llama 4 Scout Q4 (needing ~55GB) fits here too. You can run two simultaneous Qwen 3 32B instances if your workload benefits from parallel inference.

Project Digits brings full CUDA and TensorRT to this tier, which matters if your tooling depends on it. The Mac options bring silence, the Apple software ecosystem, and in the Studio’s case, higher memory bandwidth. The model capability is the same either way.

What this tier cannot run: Llama 4 Maverick at Q4 needs around 294GB – it is simply too large for a single 128GB machine at any reasonable quantisation.

  • ✅ Qwen 3 72B (Q4, ~48GB – the recommended default for this tier)
  • ✅ Llama 4 Scout (Q4, ~55GB – fits with good headroom)
  • ✅ DeepSeek R1 70B (Q4, ~48GB)
  • ✅ Llama 3.3 70B (Q4, ~48GB)
  • ✅ Two simultaneous Qwen 3 32B instances
  • ❌ Llama 4 Maverick Q4 (needs ~294GB)

The 192GB tier: Maverick unlocked

Mac Studio M3 Ultra (192GB unified memory) and Tinybox pro (192GB VRAM across eight RTX 4090s) both hit the threshold where Llama 4 Maverick becomes viable. Unsloth’s 1.78-bit quantisation brings Maverick (400B+ total parameters, MoE architecture) down to roughly 96GB – which fits at 192GB with room to spare.

Maverick at ultra-low-bit is a meaningful step up in capability from the 128GB tier. Whether that quality gap justifies the hardware cost depends on your use case, but the option is there where it was not before.

One architecture note specific to Tinybox: the VRAM is distributed across multiple discrete GPUs, not a single contiguous pool. The red has 96GB across six GPUs; the green 144GB across six; the pro 192GB across eight. Models need to support tensor parallelism to use the full pool. llama.cpp handles this, and Ollama handles it too – but it is worth confirming your tooling does before assuming the full memory is available to a single model.

  • ✅ Everything in the 128GB tier, comfortably
  • ✅ Llama 4 Scout near full precision
  • ✅ Llama 4 Maverick ultra-low-bit (Unsloth 1.78-bit, ~96GB – fits with headroom)
  • ✅ Qwen 3 72B near FP16 (higher quality than Q4)

Two Project Digits linked: the value anomaly

Two Project Digits units linked together give 256GB unified memory for $6,000. That puts them in the same capability tier as the Tinybox pro at $40,000 – Llama 4 Maverick ultra-low-bit runs comfortably, Llama 4 Scout runs near full precision, Qwen 3 72B at near FP16 is possible. The full CUDA stack is included.

The tradeoff: two separate physical units rather than one integrated rack system, and Digits lacks the open-source tinygrad story that some research teams value. But if your primary concern is model capability per dollar, two Digits at $6,000 running everything the $40,000 Tinybox pro runs is the most interesting value proposition in this comparison. Approaching Tinybox pro capability at roughly one-seventh the price is not a marginal difference.

Decision Framework

If your team is doing training or large batch inference and is comfortable with Linux and open-source tooling, Tinybox is worth evaluating. The contiguous VRAM pool is genuinely useful for certain workloads. The tinygrad framework has a real community behind it.

If your team wants local inference with minimal ops burden – quiet hardware, great software ecosystem, privacy – Apple Silicon is the default recommendation. The cluster scaling story is underappreciated. A four-Mac-Studio cluster for $40K doing work that would cost $780K in H100s is a real engineering argument, not a marketing claim.

If your team is already on CUDA and wants to move inference local without changing anything in your software stack, Project Digits is the obvious choice. Best value per petaFLOP. Full Nvidia stack. Compact enough to sit on a desk.

The market for local AI hardware is moving fast. What is notable is that two of the three compelling options – Apple Silicon clusters and Project Digits – did not exist in their current form eighteen months ago. The hardware has caught up to the demand.

Pick based on your software stack and your tolerance for operational complexity. The specs are all good enough. The philosophy is the differentiator.