TL;DR
This paper analyzes how bias mitigation techniques alter the internal representations of foundation models like BERT and Llama2, demonstrating reduced gender-occupation bias through geometric changes in embeddings.
Contribution
It provides an internal representational analysis of bias mitigation effects and introduces WinoDec, a new dataset for assessing decoder-only models.
Findings
Bias mitigation reduces gender-occupation disparities in embeddings.
Representational shifts are consistent across different model architectures.
Embedding analysis can validate debiasing effectiveness.
Abstract
We investigate how successful bias mitigation reshapes the embedding space of encoder-only and decoder-only foundation models, offering an internal audit of model behaviour through representational analysis. Using BERT and Llama2 as representative architectures, we assess the shifts in associations between gender and occupation terms by comparing baseline and bias-mitigated variants of the models. Our findings show that bias mitigation reduces gender-occupation disparities in the embedding space, leading to more neutral and balanced internal representations. These representational shifts are consistent across both model types, suggesting that fairness improvements can manifest as interpretable and geometric transformations. These results position embedding analysis as a valuable tool for understanding and validating the effectiveness of debiasing methods in foundation models. To further…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
