Synthetically generated text for supervised text analysis
Andrew Halterman

TL;DR
This paper explores the use of large language models to generate synthetic text for supervised political text analysis, addressing data scarcity, privacy, and copyright issues.
Contribution
It introduces a framework for controlled synthetic text generation, discusses ethical considerations, and demonstrates practical applications in political science research.
Findings
Synthetic tweets effectively describe real-world events.
Synthetic data improves training for event detection systems.
Multilingual synthetic corpora aid in classifying political language.
Abstract
Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
