Large Language Models Are Bad Dice Players: LLMs Struggle to Generate Random Numbers from Statistical Distributions
Minda Zhao, Yilun Du, Mengyu Wang

TL;DR
This study reveals that large language models struggle to accurately sample from probability distributions, especially in independent requests, impacting their reliability in stochastic applications.
Contribution
It provides the first large-scale, statistically rigorous evaluation of LLMs' native probabilistic sampling capabilities, highlighting significant limitations.
Findings
Batch sampling achieves 7% median validity
Independent requests pass nearly none of the distributions
Sampling fidelity worsens with distribution complexity and sample size
Abstract
As large language models (LLMs) transition from chat interfaces to integral components of stochastic pipelines and systems approaching general intelligence, the ability to faithfully sample from specified probability distributions has become a functional requirement rather than a theoretical curiosity. We present the first large-scale, statistically powered audit of native probabilistic sampling in frontier LLMs, benchmarking 11 models across 15 distributions. To disentangle failure modes, we employ a dual-protocol design: Batch Generation, where a model produces samples within one response, and Independent Requests, comprising stateless calls. We observe a sharp protocol asymmetry: batch generation achieves only modest statistical validity, with a 7% median pass rate, while independent requests collapse almost entirely, with 10 of 11 models passing none of the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
