Llm

Benchmark or Breakthrough: GPT-5.4 and the Ramsey Hypergraph Question March 24, 2026
GPT-5.4 Pro improved a constant on a Ramsey bound in Epoch's FrontierMath benchmark. Here is what that actually means, and why the answer requires nuance.
iPhone 17 Pro Runs a 400B Model Locally. Here's What That Actually Means. March 23, 2026
The iPhone 17 Pro has been demonstrated running a 400B parameter model locally via storage-as-RAM paging at 0.6 tokens per second. That speed makes it useless for production work today -- but the architectural threshold it crosses matters.
Are LLMs Finally Reliable Enough for Production? The Hallucination Rate Story March 23, 2026
Hallucination rates have dropped dramatically in narrow tasks like summarisation and code generation, but the picture is genuinely mixed -- some benchmarks show improvement while others reveal that more capable models can actually hallucinate more. Here is what the data actually shows, and which deployment decisions it should change.
The AI Model Landscape: A Practical Guide for Engineering Teams March 22, 2026
The model landscape has shifted again: Qwen 3 replaces Qwen 2.5 as the self-hosting recommendation, Llama 4 Scout and Maverick are now options for local inference, and the Mac Studio cluster story has changed the team-scale economics calculation.
Why LLMs Need Bayesian Reasoning (and How Google Is Teaching It) March 9, 2026
Google Research published a paper showing LLMs can be trained to reason like Bayesians -- updating beliefs as evidence arrives rather than pattern-matching to a confident answer. For engineers running production systems, this matters more than most benchmark improvements.
The Agentic Evolution: From LLMs to Coding Agents to Whatever Comes Next March 6, 2026
Most engineers have already crossed the first threshold from LLMs to coding agents without fully realising it. The next threshold -- autonomous agents -- is closer than they think, and the skills required are different again.