Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection
Paloma Piot, David Otero, Patricia Mart\'in-Rodilla, Javier Parapar

TL;DR
This paper investigates the reliability of Large Language Models in hate speech detection, revealing that while they diverge from humans at the instance level, they can reliably reflect relative model performance trends, serving as scalable proxies for evaluation.
Contribution
The study introduces a subjectivity-aware framework, cross-Rater Reliability (xRR), to assess LLM reliability and demonstrates their potential as proxy evaluators for model performance ranking.
Findings
LLMs diverge from humans at the instance level in hate speech detection.
LLMs can reliably reflect relative performance trends across models.
LLMs may serve as scalable proxies for subjective NLP task evaluation.
Abstract
Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's , oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection
