Position: Mechanistic Interpretability Should Prioritize Feature Consistency in SAEs
Xiangchen Song, Aashiq Muhamed, Yujia Zheng, Lingjing Kong, Zeyu Tang, Mona T. Diab, Virginia Smith, Kun Zhang

TL;DR
This paper advocates for prioritizing feature consistency in Sparse Autoencoders to improve the reliability of mechanistic interpretability, introducing a new metric and validating its effectiveness across synthetic and real-world data.
Contribution
It introduces PW-MCC as a practical metric for feature consistency, provides theoretical and empirical validation, and emphasizes the importance of consistency for progress in MI.
Findings
High PW-MCC scores (0.80) achieved with proper architecture.
Feature consistency correlates with semantic similarity of explanations.
Synthetic validation confirms PW-MCC as a reliable proxy.
Abstract
Sparse Autoencoders (SAEs) are a prominent tool in mechanistic interpretability (MI) for decomposing neural network activations into interpretable features. However, the aspiration to identify a canonical set of features is challenged by the observed inconsistency of learned SAE features across different training runs, undermining the reliability and efficiency of MI research. This position paper argues that mechanistic interpretability should prioritize feature consistency in SAEs -- the reliable convergence to equivalent feature sets across independent runs. We propose using the Pairwise Dictionary Mean Correlation Coefficient (PW-MCC) as a practical metric to operationalize consistency and demonstrate that high levels are achievable (0.80 for TopK SAEs on LLM activations) with appropriate architectural choices. Our contributions include detailing the benefits of prioritizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHistorical Studies in Central America
MethodsSparse Evolutionary Training
