Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection

Paloma Piot; David Otero; Patricia Mart\'in-Rodilla; Javier Parapar

arXiv:2512.09662·cs.CL·December 11, 2025

Can LLMs Evaluate What They Cannot Annotate? Revisiting LLM Reliability in Hate Speech Detection

Paloma Piot, David Otero, Patricia Mart\'in-Rodilla, Javier Parapar

PDF

Open Access

TL;DR

This paper investigates the reliability of Large Language Models in hate speech detection, revealing that while they diverge from humans at the instance level, they can reliably reflect relative model performance trends, serving as scalable proxies for evaluation.

Contribution

The study introduces a subjectivity-aware framework, cross-Rater Reliability (xRR), to assess LLM reliability and demonstrates their potential as proxy evaluators for model performance ranking.

Findings

01

LLMs diverge from humans at the instance level in hate speech detection.

02

LLMs can reliably reflect relative performance trends across models.

03

LLMs may serve as scalable proxies for subjective NLP task evaluation.

Abstract

Hate speech spreads widely online, harming individuals and communities, making automatic detection essential for large-scale moderation, yet detecting it remains difficult. Part of the challenge lies in subjectivity: what one person flags as hate speech, another may see as benign. Traditional annotation agreement metrics, such as Cohen's $κ$ , oversimplify this disagreement, treating it as an error rather than meaningful diversity. Meanwhile, Large Language Models (LLMs) promise scalable annotation, but prior studies demonstrate that they cannot fully replace human judgement, especially in subjective tasks. In this work, we reexamine LLM reliability using a subjectivity-aware framework, cross-Rater Reliability (xRR), revealing that even under fairer lens, LLMs still diverge from humans. Yet this limitation opens an opportunity: we find that LLM-generated annotations can reliably…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Sentiment Analysis and Opinion Mining · Spam and Phishing Detection