Synthetic social data: trials and tribulations
Guido Ivetta, Laura Moradbakhti, Rafael A. Calvo

TL;DR
This paper evaluates the reliability of social data generated by Large Language Models compared to actual human survey data across four countries, highlighting limitations of synthetic data for social research.
Contribution
It provides a comparative analysis of LLM-generated social data versus real survey data, revealing the biases and limitations of AI-generated social insights.
Findings
Synthetic data often less reliable than small real samples
Algorithmic biases in LLMs can overshadow real-world biases
Empirical human data remains crucial for social research
Abstract
Large Language Models are being used in conversational agents that simulate human conversations and generate social studies data. While concerns about the models' biases have been raised and discussed in the literature, much about the data generated is still unknown. In this study we explore the statistical representation of social values across four countries (UK, Argentina, USA and China) for six LLMs, with equal representation for open and closed weights. By comparing machine-generated outputs with actual human survey data, we assess whether algorithmic biases in LLMs outweigh the biases inherent in real- world sampling, including demographic and response biases. Our findings suggest that, despite the logistical and financial constraints of human surveys, even a small, skewed sample of real respondents may provide more reliable insights than synthetic data produced by LLMs. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational and Text Analysis Methods · Mental Health via Writing · Human Mobility and Location-Based Analysis
