CEIR: Concept-based Explainable Image Representation Learning

Yan Cui; Shuhong Liu; Liuzhuozheng Li; Zhiyuan Yuan

arXiv:2312.10747·cs.CV·December 19, 2023·1 cites

CEIR: Concept-based Explainable Image Representation Learning

Yan Cui, Shuhong Liu, Liuzhuozheng Li, Zhiyuan Yuan

PDF

Open Access 3 Reviews

TL;DR

CEIR introduces a concept-based, explainable image representation learning method that leverages pretrained models and GPT-4 generated concepts to improve interpretability and clustering performance in an unsupervised manner.

Contribution

The paper proposes a novel concept-based framework combining CBM, CLIP, GPT-4, and VAE to enhance interpretability and performance of unsupervised image representations.

Findings

01

Achieves state-of-the-art clustering on CIFAR datasets

02

Provides human-understandable concept attributions

03

Enables open-world concept extraction without fine-tuning

Abstract

In modern machine learning, the trend of harnessing self-supervised learning to derive high-quality representations without label dependency has garnered significant attention. However, the absence of label information, coupled with the inherently high-dimensional nature, improves the difficulty for the interpretation of learned representations. Consequently, indirect evaluations become the popular metric for evaluating the quality of these features, leading to a biased validation of the learned representation rationale. To address these challenges, we introduce a novel approach termed Concept-based Explainable Image Representation (CEIR). Initially, using the Concept-based Model (CBM) incorporated with pretrained CLIP and concepts generated by GPT-4, we project input images into a concept vector space. Subsequently, a Variational Autoencoder (VAE) learns the latent representation from…

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 4

Strengths

+ The paper addresses an important issue in representation learning: the ability to learn human-understandable representation without the need for a large annotated dataset. + The proposed idea is simple, leveraging different existing models: clip-based models, text generative models, VAE. It is a nice way to combine existing ideas in the field of representation learning and XAI. + The proposed approach enables state-of-the-art results on clustering tasks on different visual classification bench

Weaknesses

I have several concerns : + My first concern is related to the positioning of the paper compared to concept-based explainable image representation. In particular, in the XAI field, some criteria and properties have been proposed to define the concept of good explainable representation (see for instance the work of Ghorbani [here](https://arxiv.org/pdf/1902.03129.pdf)) such as meaningfulness, coherency, and importance. How the proposed approach tackles these aspects is not clear and not evaluated

Reviewer 02Rating 3· reject, not good enoughConfidence 4

Strengths

The method produces impressive clustering results on ImageNet, CIFAR, and STL-10 datasets (comparable to or above state of the art). Using CLIP to find nameable concepts for XAI is a good idea, and the paper demonstrates how this makes it easier to access and interact with the concepts (e.g., find more images from another class that contain the same concept as a given class).

Weaknesses

The writing is frequently unclear, which makes many parts of the paper hard to understand. The evaluation is limited to everyday object/scene datasets (ImageNet, CIFAR, STL-10) which are the datasets where this approach should work best due to the high overlap with CLIP’s training set. It would be nice to see evaluation on a broader range of datasets. There's no user experiment, so it's unclear if these explanations would be useful for humans or how they compare to other XAI approaches. The e

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

(1) The idea may be interesting but I don’t understand the method and I’m not very sure about this.

Weaknesses

(1) I cannot understand what the method is doing. For example, I don’t see what the superscript “3” in Eq. (1) means. Does it mean cubic? Why it’s necessary? $\mathcal{Q}$, $\mathcal{H}$, $\mathcal{A}(\mathbb{R}^\mathcal{Q})$, and $\mathbb{R}$ are not defined (at least the paper is not self-contained). All these missing details make it almost impossible to see what’s going on in the method. (2) My major question is the reason why the backbone and the projection layer are necessary. If $P_{i,:}$

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Radiomics and Machine Learning in Medical Imaging · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Adam · Layer Normalization · Residual Connection