Debiasing Methods in Natural Language Understanding Make Bias More Accessible
Michael Mendelson, Yonatan Belinkov

TL;DR
This paper introduces a probing framework to interpret biases in language models and finds that debiasing efforts may inadvertently increase bias encoding within model representations.
Contribution
It presents a novel information-theoretic probing method to analyze biases in language models and reveals that debiasing can make biases more accessible in internal representations.
Findings
Debiasing can increase bias encoding in model representations.
Proposed a probing-based framework for bias interpretation.
Counter-intuitive result that debiased models may encode more bias.
Abstract
Model robustness to bias is often determined by the generalization on carefully designed out-of-distribution datasets. Recent debiasing methods in natural language understanding (NLU) improve performance on such datasets by pressuring models into making unbiased predictions. An underlying assumption behind such methods is that this also leads to the discovery of more robust features in the model's inner representations. We propose a general probing-based framework that allows for post-hoc interpretation of biases in language models, and use an information-theoretic approach to measure the extractability of certain biases from the model's representations. We experiment with several NLU datasets and known biases, and show that, counter-intuitively, the more a language model is pushed towards a debiased regime, the more bias is actually encoded in its inner representations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
