Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions

Olivier Toubia; George Z. Gui; Tianyi Peng; Daniel J. Merlau; Ang Li; Haozhe Chen

arXiv:2505.17479·cs.CY·May 26, 2025

Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions

Olivier Toubia, George Z. Gui, Tianyi Peng, Daniel J. Merlau, Ang Li, Haozhe Chen

PDF

1 Repo 2 Datasets

TL;DR

This paper introduces a large, publicly available dataset of over 2,000 individuals with extensive behavioral and demographic data, facilitating the development of digital twins and advancing research in AI and social sciences.

Contribution

The creation and release of a comprehensive, multi-wave dataset with 500 questions covering diverse human attributes, enabling new research and benchmarking in digital twin and behavioral modeling.

Findings

01

High data quality and test-retest reliability.

02

Potential for accurate individual and aggregate behavior prediction.

03

Supports broad social science research and AI applications.

Abstract

LLM-based digital twin simulation, where large language models are used to emulate individual human behavior, holds great promise for research in AI, social science, and digital experimentation. However, progress in this area has been hindered by the scarcity of real, individual-level datasets that are both large and publicly available. This lack of high-quality ground truth limits both the development and validation of digital twin methodologies. To address this gap, we introduce a large-scale, public dataset designed to capture a rich and holistic view of individual human behavior. We survey a representative sample of $N = 2, 058$ participants (average 2.42 hours per person) in the US across four waves with 500 questions in total, covering a comprehensive battery of demographic, psychological, economic, personality, and cognitive measures, as well as replications of behavioral…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tianyipeng-lab/digital-twin-simulation
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.