Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies
Gati Aher, Rosa I. Arriaga, Adam Tauman Kalai

TL;DR
This paper introduces Turing Experiments to evaluate how well large language models can simulate human behavior across various psychological and economic studies, revealing both capabilities and distortions.
Contribution
It proposes a novel testing framework for assessing language models' ability to replicate human behaviors in research settings, highlighting strengths and limitations.
Findings
Models replicate classic experiments like Ultimatum Game and Milgram Shock.
Identifies a hyper-accuracy distortion in some models affecting applications.
Demonstrates the utility of Turing Experiments for behavioral simulation evaluation.
Abstract
We introduce a new type of test, called a Turing Experiment (TE), for evaluating to what extent a given language model, such as GPT models, can simulate different aspects of human behavior. A TE can also reveal consistent distortions in a language model's simulation of a specific human behavior. Unlike the Turing Test, which involves simulating a single arbitrary individual, a TE requires simulating a representative sample of participants in human subject research. We carry out TEs that attempt to replicate well-established findings from prior studies. We design a methodology for simulating TEs and illustrate its use to compare how well different language models are able to reproduce classic economic, psycholinguistic, and social psychology experiments: Ultimatum Game, Garden Path Sentences, Milgram Shock Experiment, and Wisdom of Crowds. In the first three TEs, the existing findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Misinformation and Its Impacts
MethodsMulti-Head Attention · Attention Is All You Need · Discriminative Fine-Tuning · GPT · Test · Linear Layer · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Linear Warmup With Cosine Annealing
