From Measurement Instruments to Data: Leveraging Theory-Driven Synthetic Training Data for Classifying Social Constructs
Lukas Birkenmaier, Matthias Roth, Indira Sen

TL;DR
This paper investigates how theory-driven synthetic training data, derived from established measurement instruments, can improve social construct classification in texts, especially reducing labeled data needs in political topics.
Contribution
It introduces a method to generate synthetic data based on social science measurement tools and evaluates its effectiveness in text classification tasks.
Findings
Synthetic data reduces labeled data requirements for political topic classification.
Theory-driven synthetic data outperforms non-conceptual data generation.
Results vary between social constructs like sexism and political topics.
Abstract
Computational text classification is a challenging task, especially for multi-dimensional social constructs. Recently, there has been increasing discussion that synthetic training data could enhance classification by offering examples of how these constructs are represented in texts. In this paper, we systematically examine the potential of theory-driven synthetic training data for improving the measurement of social constructs. In particular, we explore how researchers can transfer established knowledge from measurement instruments in the social sciences, such as survey scales or annotation codebooks, into theory-driven generation of synthetic data. Using two studies on measuring sexism and political topics, we assess the added value of synthetic training data for fine-tuning text classification models. Although the results of the sexism study were less promising, our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEvaluation and Performance Assessment
