Twin-2K-500: A dataset for building digital twins of over 2,000 people based on their answers to over 500 questions
Olivier Toubia, George Z. Gui, Tianyi Peng, Daniel J. Merlau, Ang Li, Haozhe Chen

TL;DR
This paper introduces a large, publicly available dataset of over 2,000 individuals with extensive behavioral and demographic data, facilitating the development of digital twins and advancing research in AI and social sciences.
Contribution
The creation and release of a comprehensive, multi-wave dataset with 500 questions covering diverse human attributes, enabling new research and benchmarking in digital twin and behavioral modeling.
Findings
High data quality and test-retest reliability.
Potential for accurate individual and aggregate behavior prediction.
Supports broad social science research and AI applications.
Abstract
LLM-based digital twin simulation, where large language models are used to emulate individual human behavior, holds great promise for research in AI, social science, and digital experimentation. However, progress in this area has been hindered by the scarcity of real, individual-level datasets that are both large and publicly available. This lack of high-quality ground truth limits both the development and validation of digital twin methodologies. To address this gap, we introduce a large-scale, public dataset designed to capture a rich and holistic view of individual human behavior. We survey a representative sample of participants (average 2.42 hours per person) in the US across four waves with 500 questions in total, covering a comprehensive battery of demographic, psychological, economic, personality, and cognitive measures, as well as replications of behavioral…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
