Synthetic social data: trials and tribulations

Guido Ivetta; Laura Moradbakhti; Rafael A. Calvo

arXiv:2510.19952·cs.CY·October 24, 2025

Synthetic social data: trials and tribulations

Guido Ivetta, Laura Moradbakhti, Rafael A. Calvo

PDF

Open Access

TL;DR

This paper evaluates the reliability of social data generated by Large Language Models compared to actual human survey data across four countries, highlighting limitations of synthetic data for social research.

Contribution

It provides a comparative analysis of LLM-generated social data versus real survey data, revealing the biases and limitations of AI-generated social insights.

Findings

01

Synthetic data often less reliable than small real samples

02

Algorithmic biases in LLMs can overshadow real-world biases

03

Empirical human data remains crucial for social research

Abstract

Large Language Models are being used in conversational agents that simulate human conversations and generate social studies data. While concerns about the models' biases have been raised and discussed in the literature, much about the data generated is still unknown. In this study we explore the statistical representation of social values across four countries (UK, Argentina, USA and China) for six LLMs, with equal representation for open and closed weights. By comparing machine-generated outputs with actual human survey data, we assess whether algorithmic biases in LLMs outweigh the biases inherent in real- world sampling, including demographic and response biases. Our findings suggest that, despite the logistical and financial constraints of human surveys, even a small, skewed sample of real respondents may provide more reliable insights than synthetic data produced by LLMs. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational and Text Analysis Methods · Mental Health via Writing · Human Mobility and Location-Based Analysis