Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge
Drago Plecko, Patrik Okanovic, Shreyas Havaldar, Torsten Hoefler, Elias Bareinboim

TL;DR
This paper introduces a benchmark to evaluate whether large language models understand real-world probability distributions, revealing their limited ability to internalize such statistical knowledge across various domains.
Contribution
The paper develops the first benchmark for assessing LLMs' knowledge of empirical distributions, highlighting their poor performance and limited understanding of real-world statistics.
Findings
LLMs perform poorly on distributional knowledge tasks
They do not naturally internalize real-world statistical distributions
Limited observational distribution knowledge implies constraints on causal reasoning
Abstract
Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions.…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
To my knowledge, the task investigated by this work is novel. It is certainly a useful task and questioning wether LLMs can satisfactorily answering such type of questions is scientifically and practically useful. The benchmark and methodology are reasonable. The paper is mostly clear and well written, and cites relevant related work.
While the task addressed is relevant, it is not clear that those type of questions are the most relevant for the end users (say, data analysts or domain experts). The text brings no justification for the particular choice of statistical questions analysed. For example, one might be more interested in deciding the mode of a distribution or statistics (median, interquartile range etc) instead of a marginal distribution of a categorical variable. The discussion about SCMs and causality are not very
The motivation is overall solid: observational knowledge is a prerequisite for meaningful causal claims, and the benchmark operationalizes that premise in a way that is modular and easy to extend. The scoring method emphasizes stability and avoids KL’s brittleness; the use of permutation averaging usefully mitigates option-ordering artifacts. The paper triangulates results across open and closed models, contrasts question-answer with likelihood prompting, and even tests retrieval and finetuning;
- The conceptual contrast between “factual” and “probabilistic” knowledge is possibly overstated. Many distributional summaries are themselves "facts" about the world expressed in text, so the paper could temper claims that distributional knowledge is categorically distinct from factual knowledge and instead clarify that the difference lies in aggregation and calibration rather than a deeper ontological difference. - The paper occasionally implies that models are claimed to “approximate real-wo
1. Focuses on observational distributions, not factual QA. Ties the goal to Pearl’s hierarchy with a precise statement of implications. 2. Ten public US datasets across health, education, labor, finance, crime, and attitudes. Tasks span 1d and higher conditional distributions. 3. Shows consistently poor performance across many open and closed models, and separates low vs high dimensional regimes.
1. Main analysis leans on multiple choice next token elicitation. Small wording or answer labeling changes can shift probabilities. The paper mentions a probability prompt variant only in the appendix. 2. High-dimensional results cover four datasets with a modest number of contexts, so evidence about the curse of dimensionality remains suggestive. 3. Some low-dimensional tasks may appear on public sites. The paper notes better scores where tables are likely online, but does not audit data leak
The research motivation behind this paper is interesting. Testing LLMs' abilities in describing real-world population distributions is a good evaluation angle.
**Presentation:** The presentation could be significantly improved. Sec. 2.1 and 2.2 contain excessive background details that do not directly contribute to the paper’s core message. Meanwhile, the main experimental analysis takes only about one page, which is insufficient to convey meaningful insights. The authors should better balance specificity and abstraction—focusing on the most critical aspects in the main text and moving only secondary details to the appendix. **Soundness:** The evalua
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning in Healthcare · Probability and Statistical Research · Data-Driven Disease Surveillance
