Leveraging GPT for the Generation of Multi-Platform Social Media Datasets for Research
Henry Tari, Danial Khan, Justus Rutten, Darian Othman, Rishabh, Kaushal, Thales Bertaglia, and Adriana Iamnitchi

TL;DR
This paper investigates using large language models, specifically ChatGPT, to generate synthetic multi-platform social media datasets that could serve research needs while overcoming access restrictions.
Contribution
It demonstrates the feasibility of using GPT models to create high-quality, multi-platform social media datasets, addressing data accessibility challenges in research.
Findings
Synthetic data closely matches real data lexically and semantically
Using GPT models for dataset generation shows promising results
Further improvements needed for higher fidelity outputs
Abstract
Social media datasets are essential for research on disinformation, influence operations, social sensing, hate speech detection, cyberbullying, and other significant topics. However, access to these datasets is often restricted due to costs and platform regulations. As such, acquiring datasets that span multiple platforms which are crucial for a comprehensive understanding of the digital ecosystem is particularly challenging. This paper explores the potential of large language models to create lexically and semantically relevant social media datasets across multiple platforms, aiming to match the quality of real datasets. We employ ChatGPT to generate synthetic data from two real datasets, each consisting of posts from three different social media platforms. We assess the lexical and semantic properties of the synthetic data and compare them with those of the real data. Our empirical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Scientific Computing and Data Management
