A Closer Look at the Explainability of Contrastive Language-Image Pre-training
Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, Xiaomeng Li

TL;DR
This paper critically examines CLIP's explainability issues, identifies causes related to architecture and features, and proposes CLIP Surgery, a method that enhances interpretability and extends CLIP's capabilities without additional training.
Contribution
The paper introduces CLIP Surgery, a novel architecture modification technique that improves CLIP's explainability and open-vocabulary performance without fine-tuning.
Findings
CLIP tends to focus on background regions in visualizations.
Noisy activations are caused by redundant features among categories.
CLIP Surgery significantly improves explainability and multimodal visualization.
Abstract
Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning
MethodsContrastive Language-Image Pre-training
