Supervised sparse auto-encoders for interpretable and compositional representations
Ouns El Harzli, Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao

TL;DR
This paper introduces supervised sparse auto-encoders that produce interpretable, compositional representations, enabling semantic image editing and generalization to unseen concept combinations.
Contribution
It adapts neural collapse theory to supervised auto-encoders, improving interpretability and compositionality of learned features for image reconstruction.
Findings
Demonstrates compositional generalization on Stable Diffusion 3.5
Enables feature-level semantic image editing
Addresses non-smoothness and alignment issues in sparse auto-encoders
Abstract
Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models, a mathematical framework from neural collapse theory, and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
