Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Vit\'oria Barin Pacela; Shruti Joshi; Isabela Camacho; Simon Lacoste-Julien; David Klindt

arXiv:2603.28744·cs.LG·March 31, 2026

Stop Probing, Start Coding: Why Linear Probes and Sparse Autoencoders Fail at Compositional Generalisation

Vit\'oria Barin Pacela, Shruti Joshi, Isabela Camacho, Simon Lacoste-Julien, David Klindt

PDF

TL;DR

This paper investigates why sparse autoencoders fail at compositional generalization under superposition, revealing that the core issue lies in dictionary learning rather than inference amortisation, and highlights the importance of scalable dictionary learning.

Contribution

The study reframes SAE failures as a dictionary learning challenge, demonstrating that improving dictionary learning is crucial for sparse inference under superposition.

Findings

01

SAEs fail under out-of-distribution compositional shifts due to poor dictionary learning.

02

Replacing the encoder with per-sample inference does not fix the failure, indicating the core issue is in dictionary learning.

03

An oracle baseline shows the problem is solvable with a good dictionary at all tested scales.

Abstract

The linear representation hypothesis states that neural network activations encode high-level concepts as linear mixtures. However, under superposition, this encoding is a projection from a higher-dimensional concept space into a lower-dimensional activation space, and a linear decision boundary in the concept space need not remain linear after projection. In this setting, classical sparse coding methods with per-sample iterative inference leverage compressed sensing guarantees to recover latent factors. Sparse autoencoders (SAEs), on the other hand, amortise sparse inference into a fixed encoder, introducing a systematic gap. We show this amortisation gap persists across training set sizes, latent dimensions, and sparsity levels, causing SAEs to fail under out-of-distribution (OOD) compositional shifts. Through controlled experiments that decompose the failure, we identify dictionary…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.