Sparsification and Reconstruction from the Perspective of Representation Geometry
Wenjie Sun, Bingzhe Wu, Zhile Yang, Chengke Wu

TL;DR
This paper explores how sparse autoencoders organize language model representations through geometric analysis, revealing their impact on feature disentanglement and reconstruction, and providing insights for interpretability tools.
Contribution
It introduces the SAEMA framework to analyze the geometric structure of sparse representations and links representational separability to reconstruction performance.
Findings
Stratified structure of representations validated by rank variability of SSPSD matrices.
Sparse encoding amplifies inter-feature distinctions and increases dimensionality.
Separable global representations causally improve reconstruction quality.
Abstract
Sparse Autoencoders (SAEs) have emerged as a predominant tool in mechanistic interpretability, aiming to identify interpretable monosemantic features. However, how does sparse encoding organize the representations of activation vector from language models? What is the relationship between this organizational paradigm and feature disentanglement as well as reconstruction performance? To address these questions, we propose the SAEMA, which validates the stratified structure of the representation by observing the variability of the rank of the symmetric semipositive definite (SSPD) matrix corresponding to the modal tensor unfolded along the latent tensor with the level of noise added to the residual stream. To systematically investigate how sparse encoding alters representational structures, we define local and global representations, demonstrating that they amplify inter-feature…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsExplainable Artificial Intelligence (XAI) · Generative Adversarial Networks and Image Synthesis · Model Reduction and Neural Networks
