Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders

Matyas Bohacek; Thomas Fel; Maneesh Agrawala; and Ekdeep Singh Lubana

arXiv:2506.19708·cs.GR·June 25, 2025

Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders

Matyas Bohacek, Thomas Fel, Maneesh Agrawala, and Ekdeep Singh Lubana

PDF

Open Access 1 Models 3 Reviews

TL;DR

This paper introduces a systematic method using sparse autoencoders to identify and analyze conceptual blindspots in generative image models, revealing specific missing or exaggerated concepts and memorization artifacts.

Contribution

It presents the largest archetypal SAE trained on DINOv2 features for detailed analysis of conceptual disparities in popular generative models.

Findings

01

Identified suppressed blindspots like bird feeders and DVD discs.

02

Detected exaggerated blindspots such as wood textures and palm trees.

03

Isolated memorization artifacts reproducing training data templates.

Abstract

Despite their impressive performance, generative image models trained on large-scale datasets frequently fail to produce images with seemingly simple concepts -- e.g., human hands or objects appearing in groups of four -- that are reasonably expected to appear in the training data. These failure modes have largely been documented anecdotally, leaving open the question of whether they reflect idiosyncratic anomalies or more structural limitations of these models. To address this, we introduce a systematic approach for identifying and characterizing "conceptual blindspots" -- concepts present in the training data but absent or misrepresented in a model's generations. Our method leverages sparse autoencoders (SAEs) to extract interpretable concept embeddings, enabling a quantitative comparison of concept prevalence between real and generated images. We train an archetypal SAE (RA-SAE) on…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 2

Strengths

- The problem is novel and well-motivated, and the paper is well-written. - While I feel like some of the observations are not new (for instance Sec 4.6), I think it is nice that the framework can be used to quantify them. - The open-source framework is highly appreciated.

Weaknesses

- Section 4.4 would benefit more from a more systematic study and quantitative results. For instance, when the authors claim that “While some of these discrepancies can be attributed to underspecified or noisy captions, others reveal genuine blindspots”, I think it would be better to quantify what portion of the images with large energy difference has noisy captions.

Reviewer 02Rating 4Confidence 3

Strengths

- The paper offers a new perspective on understanding the failure modes of generative models and provide mathematical definition of conceptual blindspot. - The method can scale easily in an unsupervised manner, free from human inputs or manual checks. - The paper is very well written

Weaknesses

- Including error analysis and failure modes of the approach and discussing those will strengthen the paper, where the SAE produces spurious concepts or misses obvious ones. - Baseline comparison is missing. How is the proposed method compared to other framework proposed in the literature. Do the observations agree? - What about mitigation? No proposed mitigation strategy on how to reduce the blindspots? How to combine this with classifier-guided generation, for example, use the proposed energy

Reviewer 03Rating 8Confidence 4

Strengths

* The paper is very well written, and together with the well-designed figures, I truly enjoyed reading the manuscript. * The work is contextualized well within the existing literature. * The proposed technique for automatically mining the blindspot failure modes of generative image models, as well as the formalism of blindspots itself, is both sound and timely. I believe the ICLR community would find these results, and the interactive tool, highly interesting. * In terms of reproducibility, aut

Weaknesses

* While not a major concern, there appears to be an implicit conceptual assumption that the discovered concepts in the SAE are non-redundant for energy discrepancy to be meaningful. It is unclear whether this assumption holds in practice. If redundancy exists empirically, it should be discussed in terms of its impact on the interpratation of results. Otherwise, it would be helpful to demonstrate that the discovered concepts are indeed unique/distinct. More on this is discussed in the questions b

Code & Models

Models

🤗
matybohacek/RA-SAE-DINOv2-32k
model· 40 dl
40 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis

MethodsDiffusion