Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation
Vit\'oria Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt

TL;DR
This paper investigates why sparse autoencoders fail at compositional generalization under superposition, revealing that the core issue lies in dictionary learning rather than inference amortisation, and highlights the importance of scalable dictionary learning.
Contribution
The study reframes SAE failures as a dictionary learning challenge, demonstrating that improving dictionary learning is crucial for sparse inference under superposition.
Findings
SAEs fail under out-of-distribution compositional shifts due to poor dictionary learning.
Replacing the encoder with per-sample inference does not fix the failure, indicating the core issue is in dictionary learning.
An oracle baseline shows the problem is solvable with a good dictionary at all tested scales.
Abstract
The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
