Diagnosing Hate Speech Classification: Where Do Humans and Machines Disagree, and Why?
Xilin Yang

TL;DR
This paper investigates discrepancies between human and machine hate speech classification using cosine similarity, embedding regression, and a large language model, revealing insights into annotation inconsistencies and model performance differences.
Contribution
It introduces diagnostic methods to analyze human-machine disagreement in hate speech detection and highlights the impact of model alignment on classification accuracy.
Findings
Humans are more sensitive to racial slurs targeting Black populations.
Machines outperform humans on long factual statements but struggle with short swear words.
Model alignment affects the ability to detect obvious hate speech.
Abstract
This study uses the cosine similarity ratio, embedding regression, and manual re-annotation to diagnose hate speech classification. We begin by computing cosine similarity ratio on a dataset "Measuring Hate Speech" that contains 135,556 annotated comments on social media. This way, we show a basic use of cosine similarity as a description of hate speech content. We then diagnose hate speech classification starting from understanding the inconsistency of human annotation from the dataset. Using embedding regression as a basic diagnostic, we found that female annotators are more sensitive to racial slurs that target the black population. We perform with a more complicated diagnostic by training a hate speech classifier using a SoTA pre-trained large language model, NV-Embed-v2, to convert texts to embeddings and run a logistic regression. This classifier achieves a testing accuracy of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
