Decomposing and Interpreting Image Representations via Text in ViTs   Beyond CLIP

Sriram Balasubramanian; Samyadeep Basu; Soheil Feizi

arXiv:2406.01583·cs.CV·October 22, 2024

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP

Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper presents a framework to interpret and decompose the internal components of various vision transformers (ViTs) using text, enabling better understanding and application of these models beyond CLIP.

Contribution

The authors introduce a general method to decompose ViT representations and interpret component roles via text, applicable to multiple ViT variants beyond CLIP.

Findings

01

Identified roles of ViT components in image feature representation

02

Enabled text-based interpretation and visualization of ViT components

03

Improved image retrieval and understanding through component analysis

Abstract

Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

sriramb-98/vit-decompose
pytorchOfficial

Videos

Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP· slideslive

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques

MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Linear Layer · Vision Transformer · Multi-Head Attention · Dropout · Dense Connections · Contrastive Language-Image Pre-training