Commissioned, Curated and Published by Russ. Researched and written with AI.


What’s New This Week

Initial publication. The Vectara hallucination leaderboard updated on March 20, 2026, showing the current best models hitting sub-4% on summarisation tasks – a useful baseline for where the field actually stands heading into mid-2026.


Changelog

DateSummary
23 Mar 2026Initial publication covering hallucination benchmarks through March 2026.

The default assumption in most engineering teams is still: LLMs hallucinate too much to trust with anything important. That assumption made sense in 2024. It is starting to need revision – but only in specific directions, and the nuance matters.

This is not a “AI is fixed now” story. It is a more useful story: for which tasks has reliability improved enough to change your deployment calculus?

What “hallucination” actually means, and why it matters which kind

The word hallucination covers at least three meaningfully different failure modes, each with different production consequences.

Confident fabrication is the dangerous one. The model asserts something false with high certainty – invents a court case, cites a nonexistent paper, states a wrong drug dosage. This is the failure mode that ends careers and causes legal liability. Research from MIT (January 2025) found that models are 34% more likely to use phrases like “definitely” and “certainly” when generating incorrect information. The more wrong the model is, the more certain it sounds. That inversion is the core problem.

Minor factual inaccuracy – slightly wrong dates, imprecise numbers, paraphrasing that loses nuance – is a different category. It matters a lot in some contexts (compliance documents, medical summaries) and barely at all in others (generating test data, drafting a first-pass email). The production consequence depends entirely on whether a human is reviewing the output.

Format and instruction errors – the model misunderstands the task, produces the wrong structure, or fails to follow a system prompt – are largely solvable through engineering. Better prompting, output validation, and retry logic handle most of these. They are annoying but not dangerous.

Most published hallucination statistics do not distinguish between these. That is a problem. A 5% hallucination rate on a summarisation benchmark is a very different thing from a 5% rate on medical diagnosis queries. Check what the benchmark is actually measuring before drawing conclusions.

What the benchmark data actually shows

The Vectara hallucination leaderboard – which specifically measures how often models introduce false information when summarising documents – was updated on March 20, 2026. The top performers on that narrow task are now quite good: GPT-5.4-nano at 3.1%, Gemini-2.5-flash-lite at 3.3%, and several others in the 4-6% range. For context, Claude Sonnet 4 (the May 2025 version) sits at 10.3% on the same benchmark, and GPT-4o (August 2024 version) at 9.6%.

That is a genuine improvement over where these models were 18 months ago. Sub-5% on summarisation was exceptional then. It is now attainable from multiple providers.

On code generation, the SWE-bench Verified trajectory is more dramatic. Claude 3.7 Sonnet, released in February 2025, scored 62.3% on the benchmark. Current top models – Claude Opus 4.5 leads at around 80% – have closed a significant gap in roughly 12 months. These numbers measure whether an agent can correctly resolve a real GitHub issue end-to-end. That is not a hallucination benchmark directly, but it is a meaningful proxy for agentic reliability: the model has to understand a codebase, reason about what is broken, and produce a correct fix.

On difficult open-ended knowledge questions, the picture is much worse. According to analysis from Suprmind’s research report (citing benchmark data), all but three of 40 tested models are more likely to hallucinate than give a correct answer on harder knowledge tasks. The leaderboard that shows 3% hallucination on summarisation is measuring a constrained task with a reference document. Remove the document and ask about obscure facts, and performance falls off sharply.

Where improvement is real vs where it is benchmark gaming

The summarisation improvement is probably real. The task is well-defined, the evaluation is automated and consistent, and multiple independent benchmarks show similar trends. When a model summarises a document it has been given, hallucination is now a manageable engineering problem for many applications.

The code generation improvement is also real, with caveats. SWE-bench Verified has faced criticism for potentially allowing overfitting – models trained on GitHub data could theoretically encounter similar issues during training. Scale Labs introduced SWE-bench Pro to test this, using harder, less-seen problems. On that benchmark, top models including GPT-5 and Claude Opus 4.1 score only around 23% – a significant drop from the 70-80% range on the standard benchmark. The improvement is real, but it is not as large as the headline numbers suggest when you test on genuinely novel problems.

The most important counterintuitive finding from the past year: reasoning models can be worse on hallucination, not better. OpenAI’s own system card for o3 noted that the model makes more claims overall – which means more correct claims but also more incorrect ones. Separately, Glean’s enterprise AI research found that o3 and o4-mini exhibit hallucination rates ranging from 33% to 79% in some enterprise contexts, more than double the rates seen in older o1 models. A model that thinks longer and more verbosely is not necessarily more accurate. It is sometimes just more confidently wrong.

Domain-specific hallucination rates remain high across all models. According to Suprmind’s benchmark compilation, even the best current models show hallucination rates around 18.7% on legal questions and 15.6% on medical queries. Those numbers have not fallen at the same rate as summarisation benchmarks. The improvement is concentrated in tasks where there is a reference document to ground the model.

Use cases where reliability has changed enough to matter

Code generation and review is the clearest win. The combination of improved SWE-bench scores, better tool use, and the feedback loop available from running tests means that coding agents can now ship working code at rates that were not possible 18 months ago. Importantly, hallucinations in code are detectable – the code either runs or it does not. Simon Willison made this point in March 2025, writing that hallucinations in code are the least dangerous form of LLM mistakes, because verification is cheap. A wrong import statement fails loudly. A wrong drug interaction does not.

Document summarisation with a provided source is now reliable enough for production in most non-critical contexts. Sub-5% hallucination rates on grounded summarisation, combined with output length and consistency, means this use case has crossed a threshold. Enterprise document processing, meeting summaries, ticket summarisation – these are deployable with appropriate review for high-stakes outputs.

Structured output generation – producing JSON, filling templates, extracting specific fields from a given document – has improved significantly. When the model is constrained to a schema and working from provided text, the failure modes are different from open-ended generation. Reliability here is good enough that many teams are running this in production without per-output human review.

Factual Q&A without a source document has not improved enough to change most deployment decisions. If you are asking a model questions it has to answer from training data alone, without RAG or tool use, the error rates remain high enough to be disqualifying for anything where an incorrect answer causes harm.

Where hallucination risk is still disqualifying

Medical, legal, and financial advice remain off-limits for autonomous LLM decision-making, and the benchmark data supports this conclusion rather than challenging it. Hallucination rates of 15-18% on domain-specific queries represent confident, fluent, wrong answers that a non-expert reader would have no way to detect. A lawyer presenting hallucinated case citations, a compliance tool generating an incorrect regulatory interpretation, a diagnostic support tool that misidentifies a drug interaction – these are active risks with the current generation of models.

This is not a “one more generation of models” problem that will resolve itself. These models are trained to generate plausible text, not to retrieve verified facts. Adding RAG helps significantly, but it introduces a different set of failure modes around retrieval quality and context handling. The failure mode shifts; it does not disappear.

Any context where a confident wrong answer causes downstream harm that cannot be caught before it causes damage belongs in this category. The test is not “how often is the model right” – it is “what happens when it is wrong, and will I find out before it matters?”

What this means for engineering teams right now

The deployment calculus has genuinely shifted for specific use cases. If your team decided against LLM integration 18 months ago because hallucination rates were too high for document summarisation or code assistance, that decision is worth revisiting. The models have improved meaningfully on those tasks.

What has not changed: the fundamental need to match the failure mode of the task to the cost of a wrong answer. The engineering approach that has become standard – use RAG for factual queries, constrain output schemas wherever possible, run automated validation on model outputs, keep humans in the loop for high-stakes decisions – is still right. Better models do not replace those patterns. They make them work better.

The most useful re-evaluation right now is probably around structured output and code review tooling. Teams that built heavy validation layers 18 months ago to compensate for format errors and minor inaccuracies may find those layers are now more friction than value. The underlying capability has moved faster than most engineering assumptions about it.

What should make you nervous: the finding that reasoning models can hallucinate more than their predecessors on certain tasks. If you have deployed o3 or similar in a high-volume factual Q&A context based on the assumption that “better reasoning equals fewer hallucinations,” it is worth running your own evaluation. The assumption may not hold in your specific context.

The question is not “has hallucination been solved?” It has not. The question is: for this specific task, with this specific failure mode, at this specific error rate – does the risk calculus now come out differently than it did 18 months ago? For a meaningful subset of production use cases, the answer is yes.