CEIR: Concept-based Explainable Image Representation Learning
Yan Cui, Shuhong Liu, Liuzhuozheng Li, Zhiyuan Yuan

TL;DR
CEIR introduces a concept-based, explainable image representation learning method that leverages pretrained models and GPT-4 generated concepts to improve interpretability and clustering performance in an unsupervised manner.
Contribution
The paper proposes a novel concept-based framework combining CBM, CLIP, GPT-4, and VAE to enhance interpretability and performance of unsupervised image representations.
Findings
Achieves state-of-the-art clustering on CIFAR datasets
Provides human-understandable concept attributions
Enables open-world concept extraction without fine-tuning
Abstract
In modern machine learning, the trend of harnessing self-supervised learning to derive high-quality representations without label dependency has garnered significant attention. However, the absence of label information, coupled with the inherently high-dimensional nature, improves the difficulty for the interpretation of learned representations. Consequently, indirect evaluations become the popular metric for evaluating the quality of these features, leading to a biased validation of the learned representation rationale. To address these challenges, we introduce a novel approach termed Concept-based Explainable Image Representation (CEIR). Initially, using the Concept-based Model (CBM) incorporated with pretrained CLIP and concepts generated by GPT-4, we project input images into a concept vector space. Subsequently, a Variational Autoencoder (VAE) learns the latent representation from…
Peer Reviews
Decision·Submitted to ICLR 2024
+ The paper addresses an important issue in representation learning: the ability to learn human-understandable representation without the need for a large annotated dataset. + The proposed idea is simple, leveraging different existing models: clip-based models, text generative models, VAE. It is a nice way to combine existing ideas in the field of representation learning and XAI. + The proposed approach enables state-of-the-art results on clustering tasks on different visual classification bench
I have several concerns : + My first concern is related to the positioning of the paper compared to concept-based explainable image representation. In particular, in the XAI field, some criteria and properties have been proposed to define the concept of good explainable representation (see for instance the work of Ghorbani [here](https://arxiv.org/pdf/1902.03129.pdf)) such as meaningfulness, coherency, and importance. How the proposed approach tackles these aspects is not clear and not evaluated
The method produces impressive clustering results on ImageNet, CIFAR, and STL-10 datasets (comparable to or above state of the art). Using CLIP to find nameable concepts for XAI is a good idea, and the paper demonstrates how this makes it easier to access and interact with the concepts (e.g., find more images from another class that contain the same concept as a given class).
The writing is frequently unclear, which makes many parts of the paper hard to understand. The evaluation is limited to everyday object/scene datasets (ImageNet, CIFAR, STL-10) which are the datasets where this approach should work best due to the high overlap with CLIP’s training set. It would be nice to see evaluation on a broader range of datasets. There's no user experiment, so it's unclear if these explanations would be useful for humans or how they compare to other XAI approaches. The e
(1) The idea may be interesting but I don’t understand the method and I’m not very sure about this.
(1) I cannot understand what the method is doing. For example, I don’t see what the superscript “3” in Eq. (1) means. Does it mean cubic? Why it’s necessary? $\mathcal{Q}$, $\mathcal{H}$, $\mathcal{A}(\mathbb{R}^\mathcal{Q})$, and $\mathbb{R}$ are not defined (at least the paper is not self-contained). All these missing details make it almost impossible to see what’s going on in the method. (2) My major question is the reason why the backbone and the projection layer are necessary. If $P_{i,:}$
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Radiomics and Machine Learning in Medical Imaging · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Adam · Layer Normalization · Residual Connection
