Reliability
- Are LLMs Finally Reliable Enough for Production? The Hallucination Rate Story
Hallucination rates have dropped dramatically in narrow tasks like summarisation and code generation, but the picture is genuinely mixed -- some benchmarks show improvement while others reveal that more capable models can actually hallucinate more. Here is what the data actually shows, and which deployment decisions it should change.
- Formal Specs in the LLM Era: The Validation Layer AI-Generated Code Is Missing
LLMs are good at generating code. They are bad at knowing whether it's correct. Informal Systems used an executable specification language called Quint to add a mechanically verifiable validation layer -- and collapsed a months-long refactor into a week.
- Amazon's Kiro Took Down AWS for 13 Hours. The Fix Reveals a Bigger Problem.
In December 2025, Amazon's internal AI coding agent Kiro caused a 13-hour AWS outage while fixing a minor bug. The real story isn't the outage -- it's what Amazon's internal memo and subsequent response reveal about how AI-assisted changes are (and aren't) being governed in production.
- Why LLMs Need Bayesian Reasoning (and How Google Is Teaching It)
Google Research published a paper showing LLMs can be trained to reason like Bayesians -- updating beliefs as evidence arrives rather than pattern-matching to a confident answer. For engineers running production systems, this matters more than most benchmark improvements.