Take its Essence, Discard its Dross! Debiasing for Toxic Language Detection via Counterfactual Causal Effect
Junyu Lu, Bo Xu, Xiaokun Zhang, Kaiyuan Liu, Dongyu Zhang, Liang Yang,, Hongfei Lin

TL;DR
This paper introduces a novel causal inference framework to effectively distinguish and remove misleading lexical biases in toxic language detection, improving accuracy and fairness.
Contribution
It proposes a Counterfactual Causal Debiasing Framework that preserves useful lexical effects while eliminating misleading biases, enhancing model performance.
Findings
Achieves state-of-the-art accuracy and fairness in toxic language detection.
Outperforms existing debiasing methods on out-of-distribution data.
Demonstrates effective separation of useful and misleading lexical biases.
Abstract
Current methods of toxic language detection (TLD) typically rely on specific tokens to conduct decisions, which makes them suffer from lexical bias, leading to inferior performance and generalization. Lexical bias has both "useful" and "misleading" impacts on understanding toxicity. Unfortunately, instead of distinguishing between these impacts, current debiasing methods typically eliminate them indiscriminately, resulting in a degradation in the detection accuracy of the model. To this end, we propose a Counterfactual Causal Debiasing Framework (CCDF) to mitigate lexical bias in TLD. It preserves the "useful impact" of lexical bias and eliminates the "misleading impact". Specifically, we first represent the total effect of the original sentence and biased tokens on decisions from a causal view. We then conduct counterfactual inference to exclude the direct causal effect of lexical bias…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Deception detection and forensic psychology
