Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

Drago Plecko; Patrik Okanovic; Shreyas Havaldar; Torsten Hoefler; Elias Bareinboim

arXiv:2511.03070·cs.AI·December 5, 2025

Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

Drago Plecko, Patrik Okanovic, Shreyas Havaldar, Torsten Hoefler, Elias Bareinboim

PDF

Open Access 4 Reviews

TL;DR

This paper introduces a benchmark to evaluate whether large language models understand real-world probability distributions, revealing their limited ability to internalize such statistical knowledge across various domains.

Contribution

The paper develops the first benchmark for assessing LLMs' knowledge of empirical distributions, highlighting their poor performance and limited understanding of real-world statistics.

Findings

01

LLMs perform poorly on distributional knowledge tasks

02

They do not naturally internalize real-world statistical distributions

03

Limited observational distribution knowledge implies constraints on causal reasoning

Abstract

Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions.…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

To my knowledge, the task investigated by this work is novel. It is certainly a useful task and questioning wether LLMs can satisfactorily answering such type of questions is scientifically and practically useful. The benchmark and methodology are reasonable. The paper is mostly clear and well written, and cites relevant related work.

Weaknesses

While the task addressed is relevant, it is not clear that those type of questions are the most relevant for the end users (say, data analysts or domain experts). The text brings no justification for the particular choice of statistical questions analysed. For example, one might be more interested in deciding the mode of a distribution or statistics (median, interquartile range etc) instead of a marginal distribution of a categorical variable. The discussion about SCMs and causality are not very

Reviewer 02Rating 4Confidence 3

Strengths

The motivation is overall solid: observational knowledge is a prerequisite for meaningful causal claims, and the benchmark operationalizes that premise in a way that is modular and easy to extend. The scoring method emphasizes stability and avoids KL’s brittleness; the use of permutation averaging usefully mitigates option-ordering artifacts. The paper triangulates results across open and closed models, contrasts question-answer with likelihood prompting, and even tests retrieval and finetuning;

Weaknesses

- The conceptual contrast between “factual” and “probabilistic” knowledge is possibly overstated. Many distributional summaries are themselves "facts" about the world expressed in text, so the paper could temper claims that distributional knowledge is categorically distinct from factual knowledge and instead clarify that the difference lies in aggregation and calibration rather than a deeper ontological difference. - The paper occasionally implies that models are claimed to “approximate real-wo

Reviewer 03Rating 4Confidence 3

Strengths

1. Focuses on observational distributions, not factual QA. Ties the goal to Pearl’s hierarchy with a precise statement of implications. 2. Ten public US datasets across health, education, labor, finance, crime, and attitudes. Tasks span 1d and higher conditional distributions. 3. Shows consistently poor performance across many open and closed models, and separates low vs high dimensional regimes.

Weaknesses

1. Main analysis leans on multiple choice next token elicitation. Small wording or answer labeling changes can shift probabilities. The paper mentions a probability prompt variant only in the appendix. 2. High-dimensional results cover four datasets with a modest number of contexts, so evidence about the curse of dimensionality remains suggestive. 3. Some low-dimensional tasks may appear on public sites. The paper notes better scores where tables are likely online, but does not audit data leak

Reviewer 04Rating 2Confidence 4

Strengths

The research motivation behind this paper is interesting. Testing LLMs' abilities in describing real-world population distributions is a good evaluation angle.

Weaknesses

**Presentation:** The presentation could be significantly improved. Sec. 2.1 and 2.2 contain excessive background details that do not directly contribute to the paper’s core message. Meanwhile, the main experimental analysis takes only about one page, which is insufficient to convey meaningful insights. The authors should better balance specificity and abstraction—focusing on the most critical aspects in the main text and moving only secondary details to the appendix. **Soundness:** The evalua

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Healthcare · Probability and Statistical Research · Data-Driven Disease Surveillance