Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

Xiangchen Song; Aashiq Muhamed; Yujia Zheng; Lingjing Kong; Zeyu Tang; Mona T. Diab; Virginia Smith; Kun Zhang

arXiv:2505.20254·cs.LG·May 27, 2025

Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs

Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang

PDF

Open Access 1 Repo

TL;DR

This paper advocates for prioritizing feature consistency in Sparse Autoencoders to improve the reliability of mechanistic interpretability, introducing a new metric and validating its effectiveness across synthetic and real-world data.

Contribution

It introduces PW-MCC as a practical metric for feature consistency, provides theoretical and empirical validation, and emphasizes the importance of consistency for progress in MI.

Findings

01

High PW-MCC scores (0.80) achieved with proper architecture.

02

Feature consistency correlates with semantic similarity of explanations.

03

Synthetic validation confirms PW-MCC as a reliable proxy.

Abstract

Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

xiangchensong/sae-feature-consistency
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHistorical Studies in Central America

MethodsSparse Evolutionary Training