Commissioned, Curated and Published by Russ. Researched and written with AI.


What’s New

First publication. No prior version.


Changelog

DateSummary
22 Mar 2026Initial publication covering the flash-moe project and its implications for local inference.

On March 18-19, 2026, Dan Woods – VP of AI Platforms at CVS Health – published flash-moe: a pure C/Metal inference engine that runs Qwen3.5-397B-A17B on a MacBook Pro M3 Max with 48GB unified memory at 5-5.7 tokens per second.

The model weighs 209GB. The machine has 48GB of RAM. It works anyway.

The reason it works tells you something important about where local inference is actually headed.

The Memory Constraint Was Always About the Wrong Number

The assumption that’s governed local LLM hardware decisions for years: a model’s parameter count dictates memory requirements. A 70B model at 4-bit needs roughly 35-40GB. A 397B model needs something north of 200GB. End of story.

Flash-MoE breaks that assumption, but it only breaks it for one category of model: Mixture-of-Experts.

Qwen3.5-397B-A17B has 397 billion total parameters, but only 17 billion are active for any given token. The architecture maintains 512 experts per layer. Normally, 10 are activated per token. The “397B” is the full parameter space. The “17B” is what actually runs.

This is not a quirk of Qwen specifically – it’s the fundamental property of MoE architecture. Sparse activation means the model is mostly idle at any given moment. Most of those 397 billion parameters are never touched during a single forward pass.

Flash-MoE exploits this directly: only the active experts per layer are loaded into memory. Each expert is approximately 6.75MB. You load the ones you need, generate a token, move on. The 209GB model streams from SSD on demand. RAM usage during inference is around 5.5GB – not 48GB, not 209GB. 5.5GB.

This is why it only works for MoE. Dense models like Llama activate every parameter for every token. There’s nothing to stream selectively. Flash-MoE’s trick requires the sparsity that MoE provides.

The K=4 Boundary

The more interesting finding is what happened when Dan pruned expert activation.

The default K value for Qwen3.5-397B is 10 – meaning 10 experts per layer are activated per token. Dan discovered you can reduce that to K=4 with no measurable quality degradation. K=3 causes immediate collapse.

In his own words, from his post on X:

“Qwen 3.5 397B has 512 experts per layer but only activates 10 per token, and we found you can prune that down to 4 with no quality degradation (K=3 causes immediate quality collapse, which suggests the routing learned to distribute critical reasoning across specific experts)”

This isn’t just an engineering optimisation. It’s a research finding about how expert routing actually works.

The sharp boundary at K=3/K=4 suggests the model’s routing isn’t evenly distributing load across experts. Certain experts are carrying disproportionate weight for coherent reasoning. Go below the threshold and you start dropping those. The quality cliff is immediate, not gradual – which tells you the critical experts aren’t redundant with each other.

The practical consequence: K=4 instead of K=10 cuts the per-token memory access from 10 experts to 4. That’s roughly 67MB per layer instead of 168MB. Combined with SSD streaming and OS page caching, this is what makes 5-5.7 tokens per second achievable.

Trust the OS

Flash-MoE has no custom memory management. No cache eviction strategy. No prefetch logic. The SSD streaming is managed entirely by the OS page cache.

This is a deliberate choice, and it’s the right one. The OS page cache on macOS has been tuned for exactly this kind of workload – repeated access to large files with spatial locality. Re-implementing that logic in application code would almost certainly perform worse, for significant added complexity.

Apple’s unified memory architecture matters here too. There’s no PCIe bottleneck between CPU memory and GPU memory. The M-series SSD is fast by PC standards. These things combine to make streaming from SSD viable in a way that wouldn’t work on a typical x86 workstation with a discrete GPU and a PCIe-connected NVMe drive.

This isn’t saying Apple hardware is magic. It’s saying the specific combination of fast SSD, unified memory, and a decent page cache is what the technique requires. Matching that on other architectures is possible but not trivially so.

What This Means for Hardware Decisions

If you’re evaluating local inference hardware – which I covered in more detail in the hardware comparison – flash-moe shifts the calculus for MoE models specifically.

The M3 Max 48GB demo machine is a relatively constrained starting point for this. An M1 Max 64GB Mac Studio or M2 Ultra with more unified memory gives you more headroom for keeping hot experts in RAM, which should reduce SSD access frequency for repeated patterns. More memory means better page cache hit rates.

The implication: for MoE inference workloads, a Mac Studio or M2/M3 Ultra is more interesting than ever. Not because the model fits in RAM – it doesn’t – but because more RAM means better streaming performance and more resident experts at any given time.

This also connects to the broader local inference moment – the pattern of capability expanding faster than expected, driven by architectural insights rather than raw hardware improvements.

Honest Limits

Flash-moe is not Ollama. Running it requires cloning the repo, compiling C code, and having a working Metal development environment. If you’re not comfortable with that workflow, this isn’t for you yet.

5-5.7 tokens per second is borderline for interactive use. For conversational response latency, you want 15 tokens per second or above. Flash-MoE at current speeds is more useful for batch tasks, agent pipelines, or background processing where response time doesn’t need to feel immediate.

Qwen3.5-397B is a significantly more capable model than Qwen3-72B, but the speed penalty is severe. Whether the quality difference is worth running at 5 t/s depends heavily on what you’re doing with it.

And again – this technique is MoE-specific. It works because sparse activation means most of the model is idle per token. Dense models (Llama, Mistral, most things without an explicit MoE architecture) don’t benefit. The technique doesn’t generalise.

How Dan Built It

Worth noting the process, not just the output. Dan used Andrej Karpathy’s autoresearch framework for structured experimentation, Apple’s “LLM in a Flash” research paper as the architectural blueprint, and Claude Code as the orchestration layer for the research process. The result is pure C and Metal – no Python, no ML frameworks, direct GPU compute.

A VP of AI Platforms at a major US healthcare company ran this as a personal project and published it open source. That says something about where the tooling is – AI-assisted research workflows are now capable enough that one person can implement and validate a research paper’s ideas in a compressed timeframe. The output is a working system, not a proof of concept.

The project is at github.com/danveloper/flash-moe, with a fork at github.com/Anemll/flash-moe. Simon Willison covered it at simonwillison.net the same day it dropped.

The assumption that local inference of very large models requires equally large memory is wrong – at least for MoE. The constraint was always active parameters, not total ones. That’s the thing worth holding onto.