When simulations look right but causal effects go wrong: Large language models as behavioral simulators
Zonghan Li, Feng Ji

TL;DR
This study evaluates large language models' ability to simulate behavioral responses to climate interventions, finding they replicate attitudes well but often misestimate causal effects, especially for behavior-related outcomes.
Contribution
It highlights the divergence between descriptive accuracy and causal fidelity in LLM simulations of interventions, emphasizing caution in interpreting their causal inferences.
Findings
LLMs reasonably replicate observed attitudinal patterns.
Prompt refinements improve descriptive fit but not causal accuracy.
Errors vary across intervention types and behavioral outcomes.
Abstract
Behavioral simulation is increasingly used to anticipate responses to interventions. Large language models (LLMs) enable researchers to specify population characteristics and intervention context in natural language, but it remains unclear to what extent LLMs can use these inputs to infer intervention effects. We evaluated three LLMs on 11 climate-psychology interventions using a dataset of 59,508 participants from 62 countries, and replicated the main analysis in two additional datasets (12 and 27 countries). LLMs reproduced observed patterns in attitudinal outcomes (e.g., climate beliefs and policy support) reasonably well, and prompting refinements improved this descriptive fit. However, descriptive fit did not reliably translate into causal fidelity (i.e., accurate estimates of intervention effects), and these two dimensions of accuracy followed different error structures. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
