Contextualizing Hate Speech Classifiers with Post-hoc Explanation

Brendan Kennedy; Xisen Jin; Aida Mostafazadeh Davani; Morteza; Dehghani; Xiang Ren

arXiv:2005.02439·cs.CL·July 8, 2020·20 cites

Contextualizing Hate Speech Classifiers with Post-hoc Explanation

Brendan Kennedy, Xisen Jin, Aida Mostafazadeh Davani, Morteza, Dehghani, Xiang Ren

PDF

Open Access 3 Repos

TL;DR

This paper introduces a post-hoc explanation method and a regularization technique for hate speech classifiers, reducing false positives related to group identifiers by emphasizing contextual understanding, thus improving bias mitigation.

Contribution

It presents a novel SOC explanation-based regularization method for BERT classifiers to better incorporate context and reduce bias in hate speech detection.

Findings

01

Reduced false positives on out-of-domain data.

02

Maintained or improved in-domain performance.

03

Enhanced model understanding of context in hate speech detection.

Abstract

Hate speech classifiers trained on imbalanced datasets struggle to determine if group identifiers like "gay" or "black" are used in offensive or prejudiced ways. Such biases manifest in false positives when these identifiers are present, due to models' inability to learn the contexts which constitute a hateful usage of identifiers. We extract SOC post-hoc explanations from fine-tuned BERT classifiers to efficiently detect bias towards identity terms. Then, we propose a novel regularization technique based on these explanations that encourages models to learn from the context of group identifiers in addition to the identifiers themselves. Our approach improved over baselines in limiting false positives on out-of-domain data while maintaining or improving in-domain performance. Project page: https://inklab.usc.edu/contextualize-hate-speech/.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHate Speech and Cyberbullying Detection · Adversarial Robustness in Machine Learning

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax