The threat of analytic flexibility in using large language models to simulate human data
Jamie Cummins

TL;DR
This paper investigates how various analytic choices in generating large language model-based synthetic data can significantly impact their fidelity to human data, highlighting a critical threat to research validity.
Contribution
It systematically demonstrates the influence of different configuration choices on the quality of silicon samples and advocates for increased awareness and strategies to mitigate this threat.
Findings
Configurations vary widely in recovering participant rankings and response distributions.
Correlation between human and silicon data can range from r = .23 to r = .84 depending on choices.
Different defensible configurations can lead to contrasting research conclusions.
Abstract
Social scientists are now using large language models to create "silicon samples": synthetic datasets intended to stand in for human respondents. However, producing these samples requires many analytic choices, including model selection, sampling parameters, prompt format, and the amount of demographic or contextual information provided. Across two studies, I examine whether these choices materially affect correspondence between silicon samples and human data. In Study 1, I generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, evaluating whether configurations recovered participant rankings, response distributions, and between-scale correlations. Configurations varied substantially across all three criteria, and configurations that performed well on one dimension often performed poorly on another. In Study 2, I extended this…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
