This human study did not involve human subjects: Validating LLM simulations as behavioral evidence
Jessica Hullman, David Broska, Huaman Sun, and Aaron Shaw

TL;DR
This paper evaluates the validity of using large language models as synthetic participants in social science experiments, contrasting heuristic and statistical calibration methods for causal inference, and discussing their assumptions and limitations.
Contribution
It clarifies when heuristic versus statistical calibration approaches are appropriate for LLM-based social science research, highlighting their assumptions and validity conditions.
Findings
Heuristic methods lack formal statistical guarantees for confirmatory research.
Statistical calibration can provide valid and more precise causal estimates under explicit assumptions.
Both approaches depend on how well LLMs approximate the relevant populations.
Abstract
A growing literature uses large language models (LLMs) as synthetic participants to generate cost-effective and nearly instantaneous responses in social science experiments. However, there is limited guidance on when such simulations support valid inference about human behavior. We contrast two strategies for obtaining valid estimates of causal effects and clarify the assumptions under which each is suitable for exploratory versus confirmatory research. Heuristic approaches seek to establish that simulated and observed human behavior are interchangeable through prompt engineering, model fine-tuning, and other repair strategies designed to reduce LLM-induced inaccuracies. While useful for many exploratory tasks, heuristic approaches lack the formal statistical guarantees typically required for confirmatory research. In contrast, statistical calibration combines auxiliary human data with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Language and cultural evolution · Topic Modeling
