Interpreting and Analysing CLIP's Zero-Shot Image Classification via Mutual Knowledge
Fawaz Sammani, Nikos Deligiannis

TL;DR
This paper investigates how CLIP's zero-shot image classification works by analyzing the shared knowledge between vision and language encoders, using textual concept explanations across multiple models.
Contribution
It introduces a mutual knowledge perspective and textual concept-based explanations to interpret CLIP's zero-shot classification, covering various model architectures and datasets.
Findings
Shared concepts influence CLIP's embedding space and classification decisions.
Textual explanations effectively clarify CLIP's zero-shot predictions.
Analysis across 13 models reveals how mutual knowledge varies with architecture and data.
Abstract
Contrastive Language-Image Pretraining (CLIP) performs zero-shot image classification by mapping images and textual class representation into a shared embedding space, then retrieving the class closest to the image. This work provides a new approach for interpreting CLIP models for image classification from the lens of mutual knowledge between the two modalities. Specifically, we ask: what concepts do both vision and language CLIP encoders learn in common that influence the joint embedding space, causing points to be closer or further apart? We answer this question via an approach of textual concept-based explanations, showing their effectiveness, and perform an analysis encompassing a pool of 13 CLIP models varying in architecture, size and pretraining datasets. We explore those different aspects in relation to mutual knowledge, and analyze zero-shot predictions. Our approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRadiomics and Machine Learning in Medical Imaging · COVID-19 diagnosis using AI · AI in cancer detection
MethodsContrastive Language-Image Pre-training
