Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Thomas Fel; Ekdeep Singh Lubana; Jacob S. Prince; Matthew Kowal; Victor Boutin; Isabel Papadimitriou; Binxu Wang; Martin Wattenberg; Demba Ba; Talia Konkle

arXiv:2502.12892·cs.CV·May 27, 2025

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Thomas Fel, Ekdeep Singh Lubana, Jacob S. Prince, Matthew Kowal, Victor Boutin, Isabel Papadimitriou, Binxu Wang, Martin Wattenberg, Demba Ba, Talia Konkle

PDF

Open Access 1 Models

TL;DR

This paper introduces Archetypal SAEs, a new stable dictionary learning method for interpretability in large vision models, addressing instability issues of traditional SAEs and improving concept extraction quality.

Contribution

We propose Archetypal SAEs (A-SAE), a novel approach that constrains dictionaries to the data convex hull, significantly enhancing stability and interpretability in large-scale vision models.

Findings

01

RA-SAEs improve dictionary stability and interpretability

02

New benchmarks assess plausibility and identifiability of concepts

03

RA-SAEs uncover semantically meaningful concepts in vision models

Abstract

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: existing SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the convex hull of data. This geometric anchoring significantly enhances the stability of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
matybohacek/RA-SAE-DINOv2-32k
model· 40 dl
40 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques · Multimodal Machine Learning Applications