Decomposing and Interpreting Image Representations via Text in ViTs Beyond CLIP
Sriram Balasubramanian, Samyadeep Basu, Soheil Feizi

TL;DR
This paper presents a framework to interpret and decompose the internal components of various vision transformers (ViTs) using text, enabling better understanding and application of these models beyond CLIP.
Contribution
The authors introduce a general method to decompose ViT representations and interpret component roles via text, applicable to multiple ViT variants beyond CLIP.
Findings
Identified roles of ViT components in image feature representation
Enabled text-based interpretation and visualization of ViT components
Improved image retrieval and understanding through component analysis
Abstract
Recent work has explored how individual components of the CLIP-ViT model contribute to the final representation by leveraging the shared image-text representation space of CLIP. These components, such as attention heads and MLPs, have been shown to capture distinct image features like shape, color or texture. However, understanding the role of these components in arbitrary vision transformers (ViTs) is challenging. To this end, we introduce a general framework which can identify the roles of various components in ViTs beyond CLIP. Specifically, we (a) automate the decomposition of the final representation into contributions from different model components, and (b) linearly map these contributions to CLIP space to interpret them via text. Additionally, we introduce a novel scoring function to rank components by their importance with respect to specific features. Applying our framework to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques
MethodsAttention Is All You Need · Residual Connection · Softmax · Layer Normalization · Linear Layer · Vision Transformer · Multi-Head Attention · Dropout · Dense Connections · Contrastive Language-Image Pre-training
