Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs
Julian Ma, Jun Wang, Zafeirios Fountas

TL;DR
This paper investigates whether large language models exhibit human-like Bayesian cue integration, introducing a new benchmark and metrics to evaluate their probabilistic reasoning and emergent principles in multimodal tasks.
Contribution
It introduces BayesBench, a psychophysics-inspired benchmark for evaluating Bayesian behaviour in LLMs, and develops a Bayesian Consistency Score to detect principled uncertainty handling.
Findings
Some LLMs show Bayesian-consistent behaviour in cue integration.
Accuracy does not necessarily imply robust Bayesian reasoning.
GPT-5 Mini achieves high accuracy but struggles with visual cue integration.
Abstract
Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance,…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper makes several notable contributions. First, the authors develop a rigorous pipeline enabling controlled ablation of noise, context, and instruction prompts, which provides systematic leverage for probing LLM computational strategies. Second, the choice of perceptual tasks is well-motivated, as these tasks are less susceptible to contamination from previously acquired statistical rules or memorized patterns compared to knowledge-based tasks. Third, the evaluation framework is comprehens
**1. Confounding between capability and decision strategy.** The relationship between overall RMSE and Bayes factor evidence (Figure 7, line 376) raises fundamental questions about what is actually being measured. Since overall RMSE is partly influenced by Bayesian inference itself, the observed negative correlation may simply reflect circular reasoning rather than revealing genuine Bayesian tendencies specific to particular LLMs. A critical issue in establishing valid tests of Bayesian inferenc
The idea of testing LLMs in a range of commonly used psychophysical and behavioral tasks is interesting. The study tested four tasks on a number of large language models. Three of them were multi-modal, so it was possible to assess whether optimal Bayesian cue combination strategies were used in LLMs for these tasks. The authors considers several models of the observer’s behavior, i.e., linear observer, static Bayesian observer, Kalman filter. The authors developed several metrics to evalua
— While several tasks were used to test several variations of LLMs, there is not a major insight learned from the study. — It was not clear how to interpret the finding that some LLMs were able to better integrate the information from the two modalities. Does this have something to do with how these models were trained (differently)? — The writing needs improvement. In various places of the paper, the interpretations of the prior literature were not accurate. I would like to suggest a careful
The paper's primary strength lies in its originality and interdisciplinary approach. Applying the rigorous, time-tested paradigm of psychophysics to probe the implicit computational strategies of LLMs is a highly novel and insightful direction, moving beyond standard accuracy-based evaluations. The methodological contribution, BayesBench, is solid. It provides a controllable and reproducible framework for testing how models handle uncertainty. The use of controlled ablations (noise, context, st
Despite the novel premise, the paper's central claim—that LLMs exhibit "emergent Bayesian behaviour"—is not adequately supported, as it rests on several questionable assumptions and interpretations. The most significant weakness is the conflation of "regression-to-the-mean" with Bayesian inference. The primary evidence for Bayesian processing is the regression effect shown in Figure 1, where estimates are biased toward the center of the stimulus range. While this pattern is consistent with Baye
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling
