Quantifying and Enabling the Interpretability of CLIP-like Models
Avinash Madasu, Yossi Gandelsman, Vasudev Lal, Phillip Howard

TL;DR
This paper investigates the interpretability of CLIP-like models by analyzing attention heads, introducing new metrics and a tool to facilitate understanding of model inner workings across different sizes and training data.
Contribution
It presents a comprehensive interpretability study of CLIP models, introduces new metrics for evaluating interpretability, and develops CLIP-InterpreT, a tool for detailed analysis.
Findings
Larger CLIP models are more interpretable than smaller ones.
The study introduces new metrics for measuring property consistency and disentanglement.
CLIP-InterpreT provides five analysis methods for understanding model behavior.
Abstract
CLIP is one of the most popular foundational models and is heavily used for many vision-language tasks. However, little is known about the inner workings of CLIP. To bridge this gap we propose a study to quantify the interpretability in CLIP like models. We conduct this study on six different CLIP models from OpenAI and OpenCLIP which vary by size, type of pre-training data and patch size. Our approach begins with using the TEXTSPAN algorithm and in-context learning to break down individual attention heads into specific properties. We then evaluate how easily these heads can be interpreted using new metrics which measure property consistency within heads and property disentanglement across heads. Our findings reveal that larger CLIP models are generally more interpretable than their smaller counterparts. To further assist users in understanding the inner workings of CLIP models, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training
