Evaluating Simple Debiasing Techniques in RoBERTa-based Hate Speech Detection Models
Diana Iftimie, Erik Zinn

TL;DR
This paper evaluates simple debiasing techniques applied to RoBERTa-based hate speech detection models, showing that their effectiveness depends on dataset construction methods and can reduce dialect-based disparities.
Contribution
It systematically assesses the impact of debiasing techniques on dialect bias in hate speech detection models using RoBERTa.
Findings
Debiasing effectiveness varies with dataset construction methods.
Proper representation bias consideration improves disparity reduction.
Simple techniques can mitigate dialect bias with careful dataset design.
Abstract
The hate speech detection task is known to suffer from bias against African American English (AAE) dialect text, due to the annotation bias present in the underlying hate speech datasets used to train these models. This leads to a disparity where normal AAE text is more likely to be misclassified as abusive/hateful compared to non-AAE text. Simple debiasing techniques have been developed in the past to counter this sort of disparity, and in this work, we apply and evaluate these techniques in the scope of RoBERTa-based encoders. Experimental results suggest that the success of these techniques depends heavily on the methods used for training dataset construction, but with proper consideration of representation bias, they can reduce the disparity seen among dialect subgroups on the hate speech detection task.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection
Methods7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)
