Towards Realistic Synthetic User-Generated Content: A Scaffolding Approach to Generating Online Discussions
Krisztian Balog, John Palowitch, Barbara Ikica, Filip, Radlinski, Hamidreza Alvari, Mehdi Manshadi

TL;DR
This paper presents a novel scaffolding approach to generate realistic large-scale synthetic online discussion data using LLMs, addressing limitations of straightforward generation and enhancing data realism for machine learning applications.
Contribution
It introduces a multi-step generation framework with discussion scaffolds, improving the realism and control of synthetic social media content creation.
Findings
Framework successfully generates realistic discussion threads
Evaluation measures effectively compare synthetic data quality
Adaptable to different social media platforms
Abstract
The emergence of synthetic data represents a pivotal shift in modern machine learning, offering a solution to satisfy the need for large volumes of data in domains where real data is scarce, highly private, or difficult to obtain. We investigate the feasibility of creating realistic, large-scale synthetic datasets of user-generated content, noting that such content is increasingly prevalent and a source of frequently sought information. Large language models (LLMs) offer a starting point for generating synthetic social media discussion threads, due to their ability to produce diverse responses that typify online interactions. However, as we demonstrate, straightforward application of LLMs yields limited success in capturing the complex structure of online discussions, and standard prompting mechanisms lack sufficient control. We therefore propose a multi-step generation process,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsWikis in Education and Collaboration · Innovative Teaching and Learning Methods · Advanced Text Analysis Techniques
