TL;DR
EZPC explains CLIP's zero-shot image recognition by projecting its embeddings into a human-understandable concept space, maintaining accuracy and enhancing interpretability without extra supervision.
Contribution
Introduces EZPC, a method that bridges CLIP's predictions with interpretable concepts using a learned projection, without requiring additional concept labels.
Findings
Maintains CLIP's zero-shot accuracy on benchmark datasets.
Provides meaningful concept-level explanations for predictions.
Grounds open-vocabulary predictions in explicit semantic concepts.
Abstract
Large-scale vision-language models such as CLIP have achieved remarkable success in zero-shot image recognition, yet their predictions remain largely opaque to human understanding. In contrast, Concept Bottleneck Models provide interpretable intermediate representations by reasoning through human-defined concepts, but they rely on concept supervision and lack the ability to generalize to unseen classes. We introduce EZPC that bridges these two paradigms by explaining CLIP's zero-shot predictions through human-understandable concepts. Our method projects CLIP's joint image-text embeddings into a concept space learned from language descriptions, enabling faithful and transparent explanations without additional supervision. The model learns this projection via a combination of alignment and reconstruction objectives, ensuring that concept activations preserve CLIP's semantic structure…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
