Commissioned, Curated and Published by Russ. Researched and written with AI.


What’s New

On 9 March 2026, the Nature Communications paper “Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models” (arXiv:2503.17523) has been making the rounds after Google Research published a companion blog post. The research, by Sjoerd van Steenkiste and Tal Linzen, demonstrates that supervised fine-tuning on demonstrations of correct Bayesian reasoning – rather than demonstrations of correct answers – substantially improves LLMs’ ability to update beliefs as new evidence arrives, and that this ability generalises to domains unseen during training.


Changelog

DateSummary
9 Mar 2026Initial publication.

You ask a model to recommend a flight. It gives you a confident answer. You say the timing doesn’t work for you. It gives you another confident answer. You say cost is the main constraint. Another confident answer. Each response sounds authoritative. None of them reflect what the model has actually learned about your preferences from the conversation so far.

That’s not a knowledge problem. The model has access to everything it needs. It’s a calibration problem: the model has no mechanism for maintaining uncertainty about what you want, updating that uncertainty as evidence arrives, and letting accumulated evidence drive progressively better recommendations. It’s pattern-matching each prompt to training data, not running a belief model.

Google Research just published a paper showing this can be fixed – and the fix generalises.

What Bayesian reasoning actually is

Brief version, since your audience knows statistics: Bayesian inference is the mathematically optimal way to update beliefs. You start with a prior, you observe evidence, you compute a posterior. The posterior becomes the new prior for the next observation. Repeat until you’re confident enough to act.

The key properties are representation and update. A Bayesian reasoner maintains a probability distribution over possible states of the world, not a point estimate. It doesn’t just say “the user wants cheap flights” – it says “there’s a 60% chance they prioritise cost, 30% chance they prioritise timing, 10% something else.” As evidence arrives, those numbers shift. By round five of a conversation, the distribution has converged on something accurate.

Standard LLMs do neither. They don’t maintain distributions. They produce the most plausible next token given the prompt. Those are fundamentally different operations.

Why LLMs don’t do this by default

Pre-training optimises for plausibility. The loss function rewards producing text that looks like the training distribution – human-written text, which tends to be confident. Uncertainty is stylistic in human writing (“I think”, “probably”, “I’m not sure”). It doesn’t correspond to a genuine probability estimate. So models learn to produce confident-sounding text because that’s what the training distribution contains.

RLHF compounds the problem. Human raters tend to prefer responses that sound definitive over responses that hedge. You train the model to be more confident because confidence scores better in human preference data.

The result: a model that produces output with the same surface confidence whether it’s 99% likely to be right or 30% likely to be right. The confidence signal is useless because it’s not calibrated to accuracy.

This is the actual mechanism behind hallucination. The naive mental model is “the model hallucinated because it didn’t know the answer.” The more accurate model: the model hallucinated because it had no mechanism to represent that it didn’t know the answer. It generated the most plausible continuation of the prompt, which happened to be wrong, with no way to flag that it was uncertain. Adding more training data to a system with this architecture produces a model with more facts and the same calibration problem.

What Google did

The paper, published in Nature Communications, tests a specific hypothesis: can you train Bayesian reasoning into an LLM by showing it demonstrations of correct Bayesian reasoning, rather than demonstrations of correct answers?

The experimental setup: a flight recommendation task. An LLM acts as an assistant across five rounds of interaction with a simulated user. Each round presents three flight options (varying by cost, duration, stops, departure time). The user selects their preferred option based on hidden preferences. The assistant receives feedback and must recommend better over time.

The benchmark is a “Bayesian Assistant” – a symbolic model that implements Bayes’ rule exactly. It maintains a probability distribution over the user’s preference profile and updates it correctly after every round. It’s the mathematical ideal.

Off-the-shelf Gemma and Qwen models performed significantly worse than the Bayesian Assistant. More telling: their performance plateaued after a single interaction. They weren’t using subsequent evidence to improve. The Bayesian Assistant kept improving across all five rounds because it was genuinely updating. The LLMs weren’t.

Two fine-tuning strategies were tested:

Oracle teaching – train the LLM on interactions where a perfect oracle (with full knowledge of user preferences) always recommended the correct flight. The training signal is “here are the right answers.”

Bayesian teaching – train the LLM on interactions from the Bayesian Assistant. Crucially, the Bayesian Assistant doesn’t know the user’s preferences upfront, so its early recommendations are uncertain best-guesses that get progressively more accurate. The training signal is “here is what correct probabilistic reasoning looks like, including the uncertainty.”

Both approaches improved performance over the baseline. Bayesian teaching was consistently better. Models fine-tuned with Bayesian teaching agreed with the optimal Bayesian Assistant’s predictions around 80% of the time. They kept improving across rounds rather than plateauing.

The critical finding: the improvement generalised. Models trained exclusively on the synthetic flight recommendation task transferred their Bayesian reasoning skills to hotel recommendations and real-world web shopping – domains unseen during fine-tuning. This is not memorisation. The models had internalised something about how to reason under uncertainty, not just the correct answers to a specific task.

The paper’s framing is precise about what’s happening here: this is distillation of a classic symbolic model into a neural network. You take a system that implements optimal probabilistic reasoning in a constrained domain, generate training data from it, and train the LLM to approximate it. The neural network then develops a general capability that the symbolic model couldn’t scale to (open-ended language, novel domains), because it’s learned the principle rather than memorised the implementation.

Why this matters for production systems

Engineers currently deal with the calibration problem through workarounds. RAG to ground responses in retrieved facts. Output validation to catch obvious errors. Chain-of-thought prompting to make reasoning visible. Manual review pipelines. These all address the symptoms. None of them address the underlying issue: the model doesn’t know what it doesn’t know.

If models genuinely develop calibrated uncertainty, the architecture of reliable AI systems changes.

Recommendation and personalisation systems. A model that maintains and updates a belief distribution over user preferences across a conversation will produce progressively better recommendations. The current approach is to pack everything into a single prompt – user history, stated preferences, context – and hope the model pattern-matches to something useful. A Bayesian-capable model could treat each user interaction as evidence, converging on accurate preference estimates over time. This maps naturally to agentic systems where the model is running repeated interactions rather than single-shot queries.

Agents that know when to ask. The problem with agentic LLMs is that they proceed with equal confidence regardless of whether they’re actually confident. An agent with calibrated uncertainty could flag when it’s operating outside its confidence threshold and ask for confirmation before taking a high-stakes action – rather than executing a destructive operation because it pattern-matched to a plausible plan. This is not solvable by prompting. It requires the model to have genuine uncertainty estimates.

RAG systems with better signal. Current RAG pipelines retrieve context and pass it to the model, but the model can’t reliably signal when the retrieved context is insufficient. A calibrated model could flag “I found relevant context but I’m not confident it covers this case” versus “I’m confident the retrieved context answers this.” That distinction would let you route uncertain cases to human review without treating all outputs as equally unreliable.

Evaluation that goes beyond accuracy. If models express calibrated uncertainty, you can evaluate them on calibration metrics – Brier scores, reliability diagrams, expected calibration error. These measure whether the model’s stated confidence correlates with its actual accuracy. Right now, accuracy is the dominant evaluation signal because confidence is noise. Calibrated models open up evaluation approaches that are arguably more useful for production systems than raw benchmark scores. A model comparison on calibration would be meaningfully informative in a way that MMLU scores aren’t.

What this doesn’t fix

This is early research on a constrained problem. The flight recommendation task is simplified: discrete features, a small number of options, a clear ground truth. Scaling Bayesian teaching to the open-ended complexity of real-world LLM deployment is a harder problem that this paper hasn’t solved.

The paper tests Gemma and Qwen at specific scales in a specific fine-tuning regime. We don’t know how this interacts with instruction tuning, RLHF, or the various post-training steps in frontier models. We don’t know whether the generalisation holds at the complexity level required for general-purpose assistants.

The fine-tuning approach requires a symbolic Bayesian model to generate training data from. That works in constrained recommendation-style tasks where you can specify and implement the optimal Bayesian strategy mathematically. For open-ended tasks, there’s no clean Bayesian Assistant to distill from. The paper acknowledges this, noting that “in this controlled setting it’s easy to implement” – implying that extending it requires solving harder problems about how to construct Bayesian teaching data for less structured domains.

None of this is a reason to dismiss the finding. The generalisation result – that Bayesian reasoning transfers across domains after training on a single domain – is genuinely significant. It suggests that LLMs can learn reasoning principles, not just task-specific patterns. That’s the core insight, and it holds regardless of the implementation challenges for production-scale systems.

Calibration is an architectural property

The instinct when a model hallucinates is to add more training data. More facts, more documents, more examples of correct answers. This instinct is wrong, or at least incomplete.

Hallucination is a calibration failure. The model produced a confident-sounding answer in a case where it should have expressed uncertainty. More training data doesn’t fix that. A model with more facts and no calibration mechanism will hallucinate more confidently. It’ll be right more often, but when it’s wrong, it’ll be just as assertive about it.

Calibration requires the model to represent uncertainty as a first-class concept – to maintain distributions over possible answers rather than committing to point estimates. Current training paradigms don’t teach this. They reward plausibility. The output looks confident because confident-sounding text is what humans write, and the model is trained to approximate human text.

Google’s result suggests calibration can be taught by showing a model what correct probabilistic reasoning looks like, in practice, across many examples, in a domain where you can verify what correct probabilistic reasoning looks like. The model internalises the principle and applies it elsewhere.

That’s a qualitatively different kind of training signal than fine-tuning on correct answers. It’s showing the model how to reason under uncertainty, not just what the right answers are. And it generalises.

For engineers building systems where reliability matters more than benchmark scores, this is the research direction worth watching. Not because it’s solved – it isn’t – but because it’s attacking the right problem.


Sources: Google Research blog | Nature Communications paper | arXiv:2503.17523