Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Zhengyang Shan; Aaron Mueller

arXiv:2512.20796·cs.CL·December 25, 2025

Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?

Zhengyang Shan, Aaron Mueller

PDF

Open Access 1 Video

TL;DR

This paper explores whether language models can be debiased to remove demographic biases without losing their ability to recognize demographic features, using targeted interventions that preserve core capabilities.

Contribution

It introduces a multi-task evaluation framework and compares attribution-based and correlation-based methods for bias mitigation, demonstrating effective, targeted debiasing techniques.

Findings

01

Autoencoder feature ablations reduce bias without harming recognition.

02

Attribution-based ablations mitigate stereotypes while preserving name recognition.

03

Correlation-based ablations are more effective for education bias.

Abstract

We investigate how independent demographic bias mechanisms are from general demographic recognition in language models. Using a multi-task evaluation setup where demographics are associated with names, professions, and education levels, we measure whether models can be debiased while preserving demographic detection capabilities. We compare attribution-based and correlation-based methods for locating bias features. We find that targeted sparse autoencoder feature ablations in Gemma-2-9B reduce bias without degrading recognition performance: attribution-based ablations mitigate race and gender profession stereotypes while preserving name recognition accuracy, whereas correlation-based ablations are more effective for education bias. Qualitative analysis further reveals that removing attribution features in education tasks induces ``prior collapse'', thus increasing overall bias. This…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Measuring Mechanistic Independence: Can Bias Be Removed Without Erasing Demographics?· underline

Taxonomy

TopicsAuthorship Attribution and Profiling · Names, Identity, and Discrimination Research · Topic Modeling