Hypothesis Testing for Quantifying LLM-Human Misalignment in Multiple Choice Settings
Harbin Hong, Sebastian Caldas, Liu Leqi

TL;DR
This paper introduces a hypothesis testing framework to evaluate how well large language models replicate human behaviors in multiple-choice survey settings, revealing significant misalignments across diverse sub-populations.
Contribution
The work provides a novel quantitative method for assessing LLM-human alignment in social science research, addressing a gap in evaluating model suitability for simulating human opinions.
Findings
The tested LLM poorly simulates diverse sub-populations.
The framework effectively detects misalignment in model behavior.
Results suggest caution in using LLMs for social science simulations.
Abstract
As Large Language Models (LLMs) increasingly appear in social science research (e.g., economics and marketing), it becomes crucial to assess how well these models replicate human behavior. In this work, using hypothesis testing, we present a quantitative framework to assess the misalignment between LLM-simulated and actual human behaviors in multiple-choice survey settings. This framework allows us to determine in a principled way whether a specific language model can effectively simulate human opinions, decision-making, and general behaviors represented through multiple-choice options. We applied this framework to a popular language model for simulating people's opinions in various public surveys and found that this model is ill-suited for simulating the tested sub-populations (e.g., across different races, ages, and incomes) for contentious questions. This raises questions about the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPrivate Equity and Venture Capital · Consumer Market Behavior and Pricing · Diverse Scientific and Economic Studies
