Llm
- Benchmark or Breakthrough: GPT-5.4 and the Ramsey Hypergraph Question
GPT-5.4 Pro improved a constant on a Ramsey bound in Epoch's FrontierMath benchmark. Here is what that actually means, and why the answer requires nuance.
- iPhone 17 Pro Runs a 400B Model Locally. Here's What That Actually Means.
The iPhone 17 Pro has been demonstrated running a 400B parameter model locally via storage-as-RAM paging at 0.6 tokens per second. That speed makes it useless for production work today -- but the architectural threshold it crosses matters.
- Are LLMs Finally Reliable Enough for Production? The Hallucination Rate Story
Hallucination rates have dropped dramatically in narrow tasks like summarisation and code generation, but the picture is genuinely mixed -- some benchmarks show improvement while others reveal that more capable models can actually hallucinate more. Here is what the data actually shows, and which deployment decisions it should change.
- The AI Model Landscape: A Practical Guide for Engineering Teams
The model landscape has shifted again: Qwen 3 replaces Qwen 2.5 as the self-hosting recommendation, Llama 4 Scout and Maverick are now options for local inference, and the Mac Studio cluster story has changed the team-scale economics calculation.
- Why LLMs Need Bayesian Reasoning (and How Google Is Teaching It)
Google Research published a paper showing LLMs can be trained to reason like Bayesians -- updating beliefs as evidence arrives rather than pattern-matching to a confident answer. For engineers running production systems, this matters more than most benchmark improvements.
- The Agentic Evolution: From LLMs to Coding Agents to Whatever Comes Next
Most engineers have already crossed the first threshold from LLMs to coding agents without fully realising it. The next threshold -- autonomous agents -- is closer than they think, and the skills required are different again.