TL;DR
Grad-ECLIP introduces a gradient-based interpretability method for CLIP, generating visual and textual explanations that reveal how image regions and words influence matching results, enhancing understanding and fine-tuning.
Contribution
It presents a novel gradient-based explanation approach for CLIP that surpasses previous methods by producing high-quality heat maps for interpretability.
Findings
Grad-ECLIP produces effective heat maps for CLIP interpretations.
The method outperforms state-of-the-art explanation techniques.
Analysis reveals insights into CLIP's working mechanism and limitations.
Abstract
Significant progress has been achieved on the improvement and downstream usages of the Contrastive Language-Image Pre-training (CLIP) vision-language model, while less attention is paid to the interpretation of CLIP. We propose a Gradient-based visual and textual Explanation method for CLIP (Grad-ECLIP), which interprets the matching result of CLIP for specific input image-text pair. By decomposing the architecture of the encoder and discovering the relationship between the matching similarity and intermediate spatial features, Grad-ECLIP produces effective heat maps that show the influence of image regions or words on the CLIP results. Different from the previous Transformer interpretation methods that focus on the utilization of self-attention maps, which are typically extremely sparse in CLIP, we produce high-quality visual explanations by applying channel and spatial weights on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
