Enforcing Orderedness to Improve Feature Consistency

Sophie L. Wang; Alex Quach; Nithin Parsan; John J. Yang

arXiv:2512.02194·cs.LG·December 3, 2025

Enforcing Orderedness to Improve Feature Consistency

Sophie L. Wang, Alex Quach, Nithin Parsan, John J. Yang

PDF

Open Access

TL;DR

This paper introduces Ordered Sparse Autoencoders (OSAE), which enforce a strict ordering of features to enhance interpretability and consistency in learned representations, addressing variability issues in traditional SAEs.

Contribution

The paper proposes OSAEs that establish feature order and use all features deterministically, resolving permutation non-identifiability in sparse dictionary learning.

Findings

01

OSAE improves feature consistency over Matryoshka baselines.

02

Theoretical proof of resolving permutation non-identifiability.

03

Empirical results on Gemma2-2B and Pythia-70M datasets.

Abstract

Sparse autoencoders (SAEs) have been widely used for interpretability of neural networks, but their learned features often vary across seeds and hyperparameter settings. We introduce Ordered Sparse Autoencoders (OSAE), which extend Matryoshka SAEs by (1) establishing a strict ordering of latent features and (2) deterministically using every feature dimension, avoiding the sampling-based approximations of prior nested SAE methods. Theoretically, we show that OSAEs resolve permutation non-identifiability in settings of sparse dictionary learning where solutions are unique (up to natural symmetries). Empirically on Gemma2-2B and Pythia-70M, we show that OSAEs can help improve consistency compared to Matryoshka baselines.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning