How MoE Sparsity and Apple Silicon SSD Architecture Make 397B Local Inference Possible

23 March 2026 - 6 mins read

Commissioned, Curated and Published by Russ. Researched and written with AI.

What’s New This Week

Flash-MoE shipped a 4-bit quantization upgrade (209GB on disk, ~4.36 tokens/second on M3 Max) after the original 2-bit version broke tool calling. The 4-bit build handles tool calls correctly and is the current production-ready version. A second implementation, Anemll/flash-moe, has also appeared, with reported M1 Ultra numbers of ~20 tokens/second at 256k context.

Changelog

Date	Summary
23 Mar 2026	Initial publication.

A shorter post on this blog covered what Flash-MoE does. This one is about why it works – specifically the three-way convergence that makes a 209GB model runnable with 5.5GB of active RAM, and why this technique is largely Apple Silicon-exclusive.

The model: 397B parameters, 17B active

Qwen3.5-397B-A17B is a Mixture-of-Experts model. The naming tells you the key fact: 397B total parameters, 17B active per token. “A17B” means active 17B.

In a dense model, every parameter participates in every forward pass. In an MoE model, a routing layer selects a subset of “experts” – specialised feed-forward networks – for each token. The rest of the weights are irrelevant for that token. Qwen3.5-397B runs 10 experts per token by default; Flash-MoE drops that to 4, which the authors found was the minimum before quality degrades sharply.

The consequence: you don’t need 209GB in RAM simultaneously. You need the weights for whichever 4 experts are relevant to the current token, plus the non-expert components (embedding table, routing matrices) that stay resident. That resident portion is 5.5GB.

The MoE sparsity is the first enabling condition. But sparsity alone doesn’t explain the speed. Streaming weights from SSD on a standard machine would be unusably slow. The second condition is what makes it fast enough to use.

Apple Silicon’s memory architecture

On a standard PC with a discrete GPU, weight streaming from SSD follows this path: NVMe -> CPU RAM -> PCIe bus -> VRAM. Each transfer crosses multiple buses, and PCIe bandwidth is the bottleneck – typically 16-32 GB/s peak, with significant latency overhead.

Apple Silicon doesn’t have that path. The CPU, GPU, Neural Engine, and SSD controller all share a unified memory fabric. There is no PCIe bus between the GPU and storage. The Metal compute pipeline can DMA weights directly from the NVMe into the memory region used by the GPU, without CPU involvement and without a bus crossing.

The M3 MacBook Pro’s NVMe runs at roughly 17 GB/s. That bandwidth is the effective limit for weight streaming throughput. Which brings up a hardware comparison that matters.

Hardware comparison: NVMe bandwidth is the bottleneck here

Hardware	NVMe Bandwidth	Flash-MoE Speed
M3 MacBook Pro 48GB	~17 GB/s	~5.5 tokens/sec
M1 Ultra	higher	~20 tokens/sec
M1 Max 64GB	~7-8 GB/s	~2-3 tokens/sec (estimated)
PC with discrete GPU	N/A (wrong architecture)	largely doesn’t work

The M1 Max 64GB is a good machine for running dense models up to 48GB at Q4 – in that case, RAM is the bottleneck and 64GB wins. But for Flash-MoE specifically, the M1 Max’s slower NVMe (~7-8 GB/s) would cap throughput well below what the M3 MacBook Pro achieves despite having more RAM. For this specific workload, SSD bandwidth matters more than RAM capacity.

The M1 Ultra at ~20 tokens/second crosses the threshold for comfortable interactive use. The MacBook Pro M3 at 5.5 tokens/second is usable for batch and async workloads – overnight document processing, research pipelines, tasks where you queue work and come back to results.

Apple’s 2023 paper made this possible

The third enabling condition is a research paper: LLM in a Flash: Efficient Large Language Model Inference with Limited Memory, published by Apple researchers in 2023. The paper specifically addresses running models larger than available DRAM by streaming parameters from flash memory on demand. It describes optimisations around contiguous chunk reads and minimising transfer volume that are directly applicable to Apple Silicon’s storage architecture.

Flash-MoE is that paper’s thesis built into production code. Dan Woods used Claude Code and a variant of Andrej Karpathy’s autoresearch pattern – feeding the paper to an LLM and running automated experiments – to produce Metal compute kernels implementing the streaming approach. The result is pure C and Metal, Apple Silicon only.

The three conditions together: MoE sparsity reduces how much you need at any moment; Apple Silicon’s unified memory architecture makes SSD-to-GPU transfer fast; the 2023 paper provided the algorithmic framework for doing it efficiently. Remove any one of them and the technique doesn’t work.

What the benchmarks mean

On M1 Ultra, the model scores: MMLU 87.86%, GPQA Diamond 82.32%, GSM8K 86.43%, IFEval 75.90%.

The GPQA Diamond number is the one worth pausing on. GPQA Diamond tests graduate-level reasoning in physics, chemistry, and biology – it’s specifically designed to be hard enough that domain experts struggle with it. At 82.32%, this model is competitive with frontier models. Claude Opus 4.6 scores 87.4% on the same benchmark; Sonnet 4.6 scores 74.1%.

Running on hardware that costs around $2,000 (48GB MacBook Pro M3). With 5.5GB of active RAM. On a consumer laptop.

The caveat the HN thread raises is real: the 2-bit quantization used in the original version showed quality degradation in longer sessions and initially broke tool calling. The 4-bit version (209GB on disk) resolves the tool calling issue, and community testing on similar-BPW quants shows that quality at this quantization level is viable for real workloads – with the usual caveat that error compounds over very long sessions.

What this changes about hardware selection

The conventional advice for local inference has been: buy as much RAM as you can. An M1 Max 64GB beats an M1 Max 32GB because you can fit larger models.

That advice still holds for dense models. If you’re running Llama 3 70B at Q4 or similar, RAM is your constraint and more is better.

For Flash-MoE specifically, the calculus inverts. You want fast NVMe, not maximum RAM. The M3 MacBook Pro with 48GB outperforms the M1 Max 64GB on this workload because its storage is faster, not because of RAM.

More broadly: the relevant constraint shifts depending on the workload. Dense models are RAM-bound. Flash-MoE is SSD-bandwidth-bound. Buying hardware based on a single axis – “more RAM is always better” – will miss the real bottleneck for workloads like this one.

The practical threshold today

At 5.5 tokens/second, Flash-MoE on an M3 MacBook Pro is viable for batch workloads where latency per token doesn’t matter: document analysis, research pipelines, batch classification, anything you queue and don’t watch. It’s not comfortable for interactive chat – most people notice when responses arrive slower than roughly 10 tokens/second.

For interactive use, you need M1 Ultra or better to clear that threshold.

For standard models up to 48GB at Q4 – Llama 3 70B, Qwen 32B, similar – an M1 Max 64GB remains the right call. RAM is the bottleneck there, and Flash-MoE’s SSD streaming doesn’t apply to dense models in the same way.

The interesting direction is what happens as this technique matures. The Anemll fork is moving independently. The quantization approaches are improving. The M3 Ultra, when it appears, will have both high NVMe bandwidth and maximum RAM. The combination could push interactive speeds on models at this capability level into genuinely useful territory.

The fundamental trade-off – 5.5GB active RAM, frontier-competitive quality, $2,000 hardware – is already real. The speed is the remaining constraint, and speed follows from silicon iteration.

Code: danveloper/flash-moe and Anemll/flash-moe. Research basis: LLM in a Flash (Apple, 2023). Simon Willison’s writeup: simonwillison.net.