Synthetically generated text for supervised text analysis

Andrew Halterman

arXiv:2303.16028·cs.CL·June 18, 2025·5 cites

Synthetically generated text for supervised text analysis

Andrew Halterman

PDF

Open Access 1 Repo

TL;DR

This paper explores the use of large language models to generate synthetic text for supervised political text analysis, addressing data scarcity, privacy, and copyright issues.

Contribution

It introduces a framework for controlled synthetic text generation, discusses ethical considerations, and demonstrates practical applications in political science research.

Findings

01

Synthetic tweets effectively describe real-world events.

02

Synthetic data improves training for event detection systems.

03

Multilingual synthetic corpora aid in classifying political language.

Abstract

Supervised text models are a valuable tool for political scientists but present several obstacles to their use, including the expense of hand-labeling documents, the difficulty of retrieving rare relevant documents for annotation, and copyright and privacy concerns involved in sharing annotated documents. This article proposes a partial solution to these three issues, in the form of controlled generation of synthetic text with large language models. I provide a conceptual overview of text generation, guidance on when researchers should prefer different techniques for generating synthetic text, a discussion of ethics, and a simple technique for improving the quality of synthetic text. I demonstrate the usefulness of synthetic text with three applications: generating synthetic tweets describing the fighting in Ukraine, synthetic news articles describing specified political events for…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ahalterman/ngec
none

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection