Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels
Nicholas Pangakis, Samuel Wolken

TL;DR
This paper evaluates the effectiveness of using large language model-generated labels as a substitute for human annotations in supervised text classification within computational social science, demonstrating comparable performance and efficiency.
Contribution
It introduces a workflow for using LLM-generated labels in supervised classification and empirically compares their performance to human labels across multiple CSS datasets.
Findings
LLM-generated labels yield comparable classification performance to human labels.
Using LLM labels is fast, cost-effective, and suitable for large-scale annotation.
The approach reduces reliance on human annotators without sacrificing accuracy.
Abstract
Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies
MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer
