Reliability

Are LLMs Finally Reliable Enough for Production? The Hallucination Rate Story March 23, 2026
Hallucination rates have dropped dramatically in narrow tasks like summarisation and code generation, but the picture is genuinely mixed -- some benchmarks show improvement while others reveal that more capable models can actually hallucinate more. Here is what the data actually shows, and which deployment decisions it should change.
Formal Specs in the LLM Era: The Validation Layer AI-Generated Code Is Missing March 12, 2026
LLMs are good at generating code. They are bad at knowing whether it's correct. Informal Systems used an executable specification language called Quint to add a mechanically verifiable validation layer -- and collapsed a months-long refactor into a week.
Amazon's Kiro Took Down AWS for 13 Hours. The Fix Reveals a Bigger Problem. March 10, 2026
In December 2025, Amazon's internal AI coding agent Kiro caused a 13-hour AWS outage while fixing a minor bug. The real story isn't the outage -- it's what Amazon's internal memo and subsequent response reveal about how AI-assisted changes are (and aren't) being governed in production.
Why LLMs Need Bayesian Reasoning (and How Google Is Teaching It) March 9, 2026
Google Research published a paper showing LLMs can be trained to reason like Bayesians -- updating beliefs as evidence arrives rather than pattern-matching to a confident answer. For engineers running production systems, this matters more than most benchmark improvements.