Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?
Zhengyang Shan, Aaron Mueller

TL;DR
This paper explores whether language models can be debiased to remove demographic biases without losing their ability to recognize demographic features, using targeted interventions that preserve core capabilities.
Contribution
It introduces a multi-task evaluation framework and compares attribution-based and correlation-based methods for bias mitigation, demonstrating effective, targeted debiasing techniques.
Findings
Autoencoder feature ablations reduce bias without harming recognition.
Attribution-based ablations mitigate stereotypes while preserving name recognition.
Correlation-based ablations are more effective for education bias.
Abstract
We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAuthorship Attribution and Profiling · Names, Identity, and Discrimination Research · Topic Modeling
