Specializing Large Language Models to Simulate Survey Response   Distributions for Global Populations

Yong Cao; Haijiang Liu; Arnav Arora; Isabelle Augenstein; Paul; R\"ottger; Daniel Hershcovich

arXiv:2502.07068·cs.CL·February 20, 2025

Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations

Yong Cao, Haijiang Liu, Arnav Arora, Isabelle Augenstein, Paul, R\"ottger, Daniel Hershcovich

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a fine-tuning approach for large language models to accurately simulate survey response distributions at the country level, aiming to reduce the need for costly survey data collection.

Contribution

It presents the first specialization method for LLMs to simulate survey responses, outperforming other approaches and zero-shot classifiers on diverse, unseen survey data.

Findings

01

Fine-tuning improves response distribution accuracy

02

Method outperforms zero-shot classifiers

03

Models still struggle with unseen questions

Abstract

Large-scale surveys are essential tools for informing social science research and policy, but running surveys is costly and time-intensive. If we could accurately simulate group-level survey results, this would therefore be very valuable to social science research. Prior work has explored the use of large language models (LLMs) for simulating human behaviors, mostly through prompting. In this paper, we are the first to specialize LLMs for the task of simulating survey response distributions. As a testbed, we use country-level results from two global cultural surveys. We devise a fine-tuning method based on first-token probabilities to minimize divergence between predicted and actual response distributions for a given question. Then, we show that this method substantially outperforms other methods and zero-shot classifiers, even on unseen questions, countries, and a completely unseen…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yongcaoplus/SimLLMCultureDist
pytorchOfficial

Videos

Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations· underline

Taxonomy

TopicsSurvey Methodology and Nonresponse