Interpreting and Analysing CLIP's Zero-Shot Image Classification via   Mutual Knowledge

Fawaz Sammani; Nikos Deligiannis

arXiv:2410.13016·cs.CV·December 19, 2024

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge

Fawaz Sammani, Nikos Deligiannis

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper investigates how CLIP's zero-shot image classification works by analyzing the shared knowledge between vision and language encoders, using textual concept explanations across multiple models.

Contribution

It introduces a mutual knowledge perspective and textual concept-based explanations to interpret CLIP's zero-shot classification, covering various model architectures and datasets.

Findings

01

Shared concepts influence CLIP's embedding space and classification decisions.

02

Textual explanations effectively clarify CLIP's zero-shot predictions.

03

Analysis across 13 models reveals how mutual knowledge varies with architecture and data.

Abstract

Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new approach for interpreting CLIP models for image classification from the lens of mutual knowledge between the two modalities. Specifically, we ask: what concepts do both vision and language CLIP encoders learn in common that influence the joint embedding space, causing points to be closer or further apart? We answer this question via an approach of textual concept-based explanations, showing their effectiveness, and perform an analysis encompassing a pool of 13 CLIP models varying in architecture, size and pretraining datasets. We explore those different aspects in relation to mutual knowledge, and analyze zero-shot predictions. Our approach…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

fawazsammani/clip-interpret-mutual-knowledge
pytorchOfficial

Videos

Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge· slideslive

Taxonomy

TopicsRadiomics and Machine Learning in Medical Imaging · COVID-19 diagnosis using AI · AI in cancer detection

MethodsContrastive Language-Image Pre-training