Does Debiasing Inevitably Degrade the Model Performance
Yiran Liu, Xiao Liu, Haotian Chen, Yang Yu

TL;DR
This paper presents a theoretical framework to understand gender bias in language models, explains why debiasing often degrades performance, and introduces a causality-based fine-tuning method that reduces bias without performance loss.
Contribution
The authors develop a theoretical explanation for bias mechanisms, identify when debiasing does not harm performance, and propose a causality-driven fine-tuning approach.
Findings
Theoretical framework clarifies bias mechanisms.
Debiasing can be achieved without performance degradation.
Causality-based fine-tuning mitigates bias while preserving performance.
Abstract
Gender bias in language models has attracted sufficient attention because it threatens social justice. However, most of the current debiasing methods degraded the model's performance on other tasks while the degradation mechanism is still mysterious. We propose a theoretical framework explaining the three candidate mechanisms of the language model's gender bias. We use our theoretical framework to explain why the current debiasing methods cause performance degradation. We also discover a pathway through which debiasing will not degrade the model performance. We further develop a causality-detection fine-tuning approach to correct gender bias. The numerical experiment demonstrates that our method is able to lead to double dividends: partially mitigating gender bias while avoiding performance degradation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Explainable Artificial Intelligence (XAI) · Adversarial Robustness in Machine Learning
