Hypothesis generation and updating in large language models
Hua-Dong Xiong

TL;DR
This paper investigates how large language models generate and update hypotheses in a controlled number game, comparing their inference behavior to Bayesian models and humans, revealing systematic biases and limitations.
Contribution
It provides a detailed analysis of LLMs' hypothesis inference, identifying biases like strong sampling and evaluation-generation gaps, and highlights their limitations in scientific reasoning.
Findings
LLMs are well described by a Bayesian model with biases.
They favor narrower hypotheses due to a strong-sampling bias.
Models generalize poorly beyond observed data.
Abstract
Large language models (LLMs) increasingly help people solve problems, from debugging code to repairing machinery. This process requires generating plausible hypotheses from partial descriptions, then updating them as more information arrives. Yet how LLMs perform this form of inference, and how close it is to optimal, remains unclear. We study this question in the number game, a controlled setting in which a learner infers the hypothesis supported by a few positive integers, such as : a rule like powers of 2 or an interval like numbers near 20. We measure the posterior over hypotheses using three complementary probes: posterior prediction, hypothesis evaluation, and hypothesis generation. We then compare LLM behavior with an optimal Bayesian model and human behavior, and test whether the same posterior is expressed across probes. LLMs are often well described by a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
