Contextualizing Hate Speech Classifiers with Post-hoc Explanation
Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza, Dehghani, Xiang Ren

TL;DR
This paper introduces a post-hoc explanation method and a regularization technique for hate speech classifiers, reducing false positives related to group identifiers by emphasizing contextual understanding, thus improving bias mitigation.
Contribution
It presents a novel SOC explanation-based regularization method for BERT classifiers to better incorporate context and reduce bias in hate speech detection.
Findings
Reduced false positives on out-of-domain data.
Maintained or improved in-domain performance.
Enhanced model understanding of context in hate speech detection.
Abstract
Hate speech classifiers trained on imbalanced datasets struggle to determine if group identifiers like "gay" or "black" are used in offensive or prejudiced ways. Such biases manifest in false positives when these identifiers are present, due to models' inability to learn the contexts which constitute a hateful usage of identifiers. We extract SOC post-hoc explanations from fine-tuned BERT classifiers to efficiently detect bias towards identity terms. Then, we propose a novel regularization technique based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves. Our approach improved over baselines in limiting false positives on out-of-domain data while maintaining or improving in-domain performance. Project page: https://inklab.usc.edu/contextualize-hate-speech/.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning
MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax
