Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Julian Ma; Jun Wang; Zafeirios Fountas

arXiv:2512.02719·cs.CL·December 3, 2025

Emergent Bayesian Behaviour and Optimal Cue Combination in LLMs

Julian Ma, Jun Wang, Zafeirios Fountas

PDF

Open Access 3 Reviews

TL;DR

This paper investigates whether large language models exhibit human-like Bayesian cue integration, introducing a new benchmark and metrics to evaluate their probabilistic reasoning and emergent principles in multimodal tasks.

Contribution

It introduces BayesBench, a psychophysics-inspired benchmark for evaluating Bayesian behaviour in LLMs, and develops a Bayesian Consistency Score to detect principled uncertainty handling.

Findings

01

Some LLMs show Bayesian-consistent behaviour in cue integration.

02

Accuracy does not necessarily imply robust Bayesian reasoning.

03

GPT-5 Mini achieves high accuracy but struggles with visual cue integration.

Abstract

Large language models (LLMs) excel at explicit reasoning, but their implicit computational strategies remain underexplored. Decades of psychophysics research show that humans intuitively process and integrate noisy signals using near-optimal Bayesian strategies in perceptual tasks. We ask whether LLMs exhibit similar behaviour and perform optimal multimodal integration without explicit training or instruction. Adopting the psychophysics paradigm, we infer computational principles of LLMs from systematic behavioural studies. We introduce a behavioural benchmark - BayesBench: four magnitude estimation tasks (length, location, distance, and duration) over text and image, inspired by classic psychophysics, and evaluate a diverse set of nine LLMs alongside human judgments for calibration. Through controlled ablations of noise, context, and instruction prompts, we measure performance,…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

The paper makes several notable contributions. First, the authors develop a rigorous pipeline enabling controlled ablation of noise, context, and instruction prompts, which provides systematic leverage for probing LLM computational strategies. Second, the choice of perceptual tasks is well-motivated, as these tasks are less susceptible to contamination from previously acquired statistical rules or memorized patterns compared to knowledge-based tasks. Third, the evaluation framework is comprehens

Weaknesses

**1. Confounding between capability and decision strategy.** The relationship between overall RMSE and Bayes factor evidence (Figure 7, line 376) raises fundamental questions about what is actually being measured. Since overall RMSE is partly influenced by Bayesian inference itself, the observed negative correlation may simply reflect circular reasoning rather than revealing genuine Bayesian tendencies specific to particular LLMs. A critical issue in establishing valid tests of Bayesian inferenc

Reviewer 02Rating 4Confidence 4

Strengths

The idea of testing LLMs in a range of commonly used psychophysical and behavioral tasks is interesting. The study tested four tasks on a number of large language models. Three of them were multi-modal, so it was possible to assess whether optimal Bayesian cue combination strategies were used in LLMs for these tasks. The authors considers several models of the observer’s behavior, i.e., linear observer, static Bayesian observer, Kalman filter. The authors developed several metrics to evalua

Weaknesses

— While several tasks were used to test several variations of LLMs, there is not a major insight learned from the study. — It was not clear how to interpret the finding that some LLMs were able to better integrate the information from the two modalities. Does this have something to do with how these models were trained (differently)? — The writing needs improvement. In various places of the paper, the interpretations of the prior literature were not accurate. I would like to suggest a careful

Reviewer 03Rating 4Confidence 4

Strengths

The paper's primary strength lies in its originality and interdisciplinary approach. Applying the rigorous, time-tested paradigm of psychophysics to probe the implicit computational strategies of LLMs is a highly novel and insightful direction, moving beyond standard accuracy-based evaluations. The methodological contribution, BayesBench, is solid. It provides a controllable and reproducible framework for testing how models handle uncertainty. The use of controlled ablations (noise, context, st

Weaknesses

Despite the novel premise, the paper's central claim—that LLMs exhibit "emergent Bayesian behaviour"—is not adequately supported, as it rests on several questionable assumptions and interpretations. The most significant weakness is the conflation of "regression-to-the-mean" with Bayesian inference. The primary evidence for Bayesian processing is the regression effect shown in Figure 1, where estimates are biased toward the center of the stimulus range. While this pattern is consistent with Baye

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Explainable Artificial Intelligence (XAI) · Topic Modeling