What Changed? Investigating Debiasing Methods using Causal Mediation Analysis
Sullam Jeoung, Jana Diesner

TL;DR
This paper uses causal mediation analysis to explore how debiasing methods impact internal model components and downstream toxicity detection, revealing the importance of specific layers and attention heads in gender bias mitigation.
Contribution
It introduces a causal mediation framework to analyze the internal effects of debiasing techniques on language models, focusing on gender bias and toxicity detection.
Findings
Debiasing effects vary across different bias metrics.
Certain layers and attention heads are more affected by debiasing.
Testing with multiple bias metrics is essential for evaluating debiasing effectiveness.
Abstract
Previous work has examined how debiasing language models affect downstream tasks, specifically, how debiasing techniques influence task performance and whether debiased models also make impartial predictions in downstream tasks or not. However, what we don't understand well yet is why debiasing methods have varying impacts on downstream tasks and how debiasing techniques affect internal components of language models, i.e., neurons, layers, and attentions. In this paper, we decompose the internal mechanisms of debiasing language models with respect to gender by applying causal mediation analysis to understand the influence of debiasing methods on toxicity detection as a downstream task. Our findings suggest a need to test the effectiveness of debiasing methods with different bias metrics, and to focus on changes in the behavior of certain components of the models, e.g.,first two layers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Explainable Artificial Intelligence (XAI) · Software Engineering Research
