SMIXAE: Towards Unsupervised Manifold Discovery in Language Models
Collin Francel

TL;DR
This paper introduces SMIXAE, a new autoencoder architecture that improves the discovery and interpretation of manifold structures in language model activations, demonstrating success on large open-source models.
Contribution
The paper proposes SMIXAE, an architecture that directly models multidimensional features in language models, enhancing manifold discovery and interpretability.
Findings
SMIXAE successfully learns known manifold structures.
SMIXAE discovers novel structures in language model activations.
Empirical evidence from Gemma models supports effectiveness.
Abstract
Sparse autoencoders (SAEs) have been used widely to decompose and interpret neural network activations, especially those of transformer language models. One key issue with SAEs is their inability to directly model multidimensional features. Instead, SAEs may tile such features by a set of independent directions that must be grouped together after the SAE training phase, impeding discoverability and interpretation of learned feature representations. We begin to address this issue by introducing the Sparse MIXture of Autoencoders (SMIXAE) architecture. Empirically, we provide evidence that SMIXAE models have success both in directly learning previously identified manifold structures, as well as finding novel structures, within the open source Gemma 2 2B and 9B models. Finally, we discuss several limitations and point towards areas for future work.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
