Enforcing Orderedness to Improve Feature Consistency
Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang

TL;DR
This paper introduces Ordered Sparse Autoencoders (OSAE), which enforce a strict ordering of features to enhance interpretability and consistency in learned representations, addressing variability issues in traditional SAEs.
Contribution
The paper proposes OSAEs that establish feature order and use all features deterministically, resolving permutation non-identifiability in sparse dictionary learning.
Findings
OSAE improves feature consistency over Matryoshka baselines.
Theoretical proof of resolving permutation non-identifiability.
Empirical results on Gemma2-2B and Pythia-70M datasets.
Abstract
Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning
