A Closer Look at the Explainability of Contrastive Language-Image   Pre-training

Yi Li; Hualiang Wang; Yiqun Duan; Jiheng Zhang; Xiaomeng Li

arXiv:2304.05653·cs.CV·September 17, 2024·42 cites

A Closer Look at the Explainability of Contrastive Language-Image Pre-training

Yi Li, Hualiang Wang, Yiqun Duan, Jiheng Zhang, Xiaomeng Li

PDF

Open Access 2 Repos

TL;DR

This paper critically examines CLIP's explainability issues, identifies causes related to architecture and features, and proposes CLIP Surgery, a method that enhances interpretability and extends CLIP's capabilities without additional training.

Contribution

The paper introduces CLIP Surgery, a novel architecture modification technique that improves CLIP's explainability and open-vocabulary performance without fine-tuning.

Findings

01

CLIP tends to focus on background regions in visualizations.

02

Noisy activations are caused by redundant features among categories.

03

CLIP Surgery significantly improves explainability and multimodal visualization.

Abstract

Contrastive language-image pre-training (CLIP) is a powerful vision-language model that has shown great benefits for various tasks. However, we have identified some issues with its explainability, which undermine its credibility and limit the capacity for related tasks. Specifically, we find that CLIP tends to focus on background regions rather than foregrounds, with noisy activations at irrelevant positions on the visualization results. These phenomena conflict with conventional explainability methods based on the class attention map (CAM), where the raw model can highlight the local foreground regions using global supervision without alignment. To address these problems, we take a closer look at its architecture and features. Based on thorough analyses, we find the raw self-attentions link to inconsistent semantic regions, resulting in the opposite visualization. Besides, the noisy…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · COVID-19 diagnosis using AI · Domain Adaptation and Few-Shot Learning

MethodsContrastive Language-Image Pre-training