Commissioned, Curated and Published by Russ. Researched and written with AI.
Someone ran a 400B parameter model on an iPhone 17 Pro. Not via API. Not offloaded to a server somewhere. Locally, on the device. It generated at 0.6 tokens per second, which is roughly one word every one to two seconds. That speed is completely unusable. It still matters.
What Actually Happened
The demo comes from an open-source project called Flash-MoE, shown running on an iPhone 17 Pro by @anemll on Twitter. The model is a Mixture of Experts architecture with around 397 billion parameters total. Even in quantized form, a model at this scale requires a minimum of 200GB of memory to run – the iPhone 17 Pro ships with 12GB of LPDDR5X RAM.
The trick that makes it work: Flash-MoE doesn’t load the model into RAM. It streams weights directly from the device’s SSD to the GPU on demand. The OS filesystem cache handles what stays hot and what gets evicted. The MoE architecture compounds the advantage – because only a subset of the model’s experts activate per token, the amount of weight that needs to be in-flight at any moment is far smaller than the total parameter count.
The result is a device with 12GB of RAM running inference against a model that, loaded conventionally, would need over 16x that memory. The SSD is doing the work RAM can’t.
The Threshold That Got Crossed
The useful framing here isn’t “this phone is fast enough for AI.” It isn’t. The framing is: a consumer device with no external connectivity just ran a model in the same parameter class as the frontier systems that defined the last two years of the industry.
Until very recently, 400B+ parameter models were infrastructure stories. You needed racks of H100s, or at minimum a system with hundreds of gigabytes of unified memory. The assumption baked into almost every on-device AI architecture decision – that frontier-scale parameters lived in the cloud – was structural. This demo doesn’t break that assumption in practice today, but it demonstrates the assumption isn’t physically necessary.
That’s a different kind of signal than incremental benchmark improvement.
The Honest Caveat
0.6 tokens per second is not a caveat. It’s a disqualifier for any real use case. Conversational inference needs somewhere in the range of 20 to 50 tokens per second to feel responsive. Code generation, document summarisation, anything latency-sensitive – all require at least an order of magnitude more throughput than this demo achieves.
The HN thread makes the other constraint concrete: thermal. Running even mid-sized models on a MacBook pushes the fans hard after twenty minutes. One commenter noted the iPhone battery got cold enough after the benchmark that the phone wouldn’t boot – running this in winter, the battery chemistry suffered before the chip did. A 400B model sustained over a longer session would almost certainly run into thermal throttling before completing anything useful.
There’s also a quality caveat the parameter count obscures. MoE models with this many total parameters don’t necessarily behave like a dense model of equivalent size. The active parameter count per token is much smaller. Whether the quality holds up in practice depends on the specific model and routing – you can’t read “400B” and assume frontier quality.
What Changes If This Trajectory Continues
The SSD-streaming approach works here because iPhone storage is fast – NVMe-class speeds, designed for low-latency random access. As storage bandwidth improves across generations and quantization techniques get tighter, the throughput gap narrows. If the pattern from desktop hardware holds – where Apple Silicon went from running 7B models comfortably to 70B models usably in roughly three years – the same curve applied to phones points toward genuinely usable larger models within this decade.
What that unlocks is privacy-absolute inference. No network call, no API log, no telemetry. The model runs entirely in your hand, against data that never leaves the device. For certain categories of use case – medical, legal, personal – that changes the calculus significantly. Today’s cloud inference providers ask you to trust their data handling. Local inference removes the question entirely.
What doesn’t change: the power-law relationship between parameter count, memory bandwidth, and latency. You can route around memory constraints with clever paging, but you can’t route around physics. Getting to 20 t/s on a 400B model on a phone isn’t an optimisation problem. It’s a materials science problem. The SSD has to get faster, or the quantization has to get more aggressive without losing quality, or both.
The Part Worth Remembering
The iPhone 17 Pro running a 400B model at 0.6 t/s isn’t a product announcement. It’s a proof that the architectural assumption was wrong. The models themselves don’t require server infrastructure – the speed requirement does. Those are separable. When the speed requirement gets met, the infrastructure assumption collapses with it.
That’s the thing to watch.