Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

Ahmad Dawar Hakimi; Lea Hirlimann; Isabelle Augenstein; Hinrich Sch\"utze

arXiv:2604.13899·cs.CL·April 23, 2026

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Sch\"utze

PDF

TL;DR

This study compares human and large language model annotations in active learning for hostility detection on a large German TikTok comments dataset, revealing that LLMs can match human performance at lower cost but with different error patterns.

Contribution

It provides a comprehensive comparison of human versus LLM annotation strategies in active learning for hostility detection, highlighting cost-effectiveness and error profile differences.

Findings

01

LLM labels can achieve similar F1 scores to human labels at lower cost.

02

Active learning offers limited advantage over random sampling in this context.

03

LLM-trained classifiers tend to over-predict positive cases, especially in ambiguous discussions.

Abstract

Instruction-tuned LLMs can annotate thousands of instances from a short prompt at negligible cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be labelled at once? We investigate both questions on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labelled, 5,000 human-annotated), comparing seven annotation strategies across four encoders to detect anti-immigrant hostility. A classifier trained on 25,974 GPT-5.2 labels ($43) achieves comparable F1-Macro to one trained on 3,800 human annotations ($316). Active learning offers little advantage over random sampling in our pre-enriched pool and delivers lower F1 than full LLM annotation at the same cost. However, comparable aggregate F1 masks a systematic difference in error structure: LLM-trained…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.