Analyzing (In)Abilities of SAEs via Formal Languages
Abhinav Menon, Manish Shrivastava, David Krueger, Ekdeep Singh Lubana

TL;DR
This paper investigates the interpretability of sparse autoencoders trained on formal language representations, revealing insights into their learned features, sensitivity to training biases, and the importance of causality in feature learning.
Contribution
It introduces a synthetic testbed for analyzing SAEs on formal languages and emphasizes the need to focus on causally relevant features during training.
Findings
Interpretable latent features often emerge in SAEs trained on formal languages.
Performance of SAEs is highly sensitive to training biases and hyperparameters.
Causality should be a central focus in SAE training to improve interpretability.
Abstract
Autoencoders have been used for finding interpretable and disentangled features underlying neural network representations in both image and text domains. While the efficacy and pitfalls of such methods are well-studied in vision, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We aim to address this gap by training sparse autoencoders (SAEs) on a synthetic testbed of formal languages. Specifically, we train SAEs on the hidden representations of models trained on formal languages (Dyck-2, Expr, and English PCFG) under a wide variety of hyperparameter settings, finding interpretable latents often emerge in the features learned by our SAEs. However, similar to vision, we find performance turns out to be highly sensitive to inductive biases of the training pipeline. Moreover, we show latents correlating to certain features of the input do…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsModel-Driven Software Engineering Techniques · Business Process Modeling and Analysis · Semantic Web and Ontologies
