Knowledge Distillation in Automated Annotation: Supervised Text   Classification with LLM-Generated Training Labels

Nicholas Pangakis; Samuel Wolken

arXiv:2406.17633·cs.CL·June 26, 2024·1 cites

Knowledge Distillation in Automated Annotation: Supervised Text Classification with LLM-Generated Training Labels

Nicholas Pangakis, Samuel Wolken

PDF

Open Access

TL;DR

This paper evaluates the effectiveness of using large language model-generated labels as a substitute for human annotations in supervised text classification within computational social science, demonstrating comparable performance and efficiency.

Contribution

It introduces a workflow for using LLM-generated labels in supervised classification and empirically compares their performance to human labels across multiple CSS datasets.

Findings

01

LLM-generated labels yield comparable classification performance to human labels.

02

Using LLM labels is fast, cost-effective, and suitable for large-scale annotation.

03

The approach reduces reliance on human annotators without sacrificing accuracy.

Abstract

Computational social science (CSS) practitioners often rely on human-labeled data to fine-tune supervised text classifiers. We assess the potential for researchers to augment or replace human-generated training data with surrogate training labels from generative large language models (LLMs). We introduce a recommended workflow and test this LLM application by replicating 14 classification tasks and measuring performance. We employ a novel corpus of English-language text classification data sets from recent CSS articles in high-impact journals. Because these data sets are stored in password-protected archives, our analyses are less prone to issues of contamination. For each task, we compare supervised classifiers fine-tuned using GPT-4 labels against classifiers fine-tuned with human annotations and against labels from GPT-4 and Mistral-7B with few-shot in-context learning. Our findings…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Text and Document Classification Technologies

MethodsAttention Is All You Need · Softmax · Layer Normalization · Absolute Position Encodings · Byte Pair Encoding · Label Smoothing · Position-Wise Feed-Forward Layer · Dropout · Adam · Linear Layer