No Token Left Behind: Explainability-Aided Image Classification and Generation
Roni Paiss, Hila Chefer, Lior Wolf

TL;DR
This paper introduces an explainability-based method to improve CLIP's stability and accuracy in zero-shot image classification and generation, addressing prompt sensitivity and enabling spatially conditioned image synthesis.
Contribution
It proposes a novel explainability-driven loss to ensure CLIP attends to all relevant semantic parts, enhancing zero-shot classification and guided image generation without additional training.
Findings
Improved recognition rates in one-shot classification.
Enhanced quality of generated images with CLIP guidance.
Enabled spatially conditioned image generation using explainability heatmaps.
Abstract
The application of zero-shot learning in computer vision has been revolutionized by the use of image-text matching models. The most notable example, CLIP, has been widely used for both zero-shot classification and guiding generative models with a text prompt. However, the zero-shot use of CLIP is unstable with respect to the phrasing of the input text, making it necessary to carefully engineer the prompts used. We find that this instability stems from a selective similarity score, which is based only on a subset of the semantically meaningful input tokens. To mitigate it, we present a novel explainability-based approach, which adds a loss term to ensure that CLIP focuses on all relevant semantic parts of the input, in addition to employing the CLIP similarity loss used in previous works. When applied to one-shot classification through prompt engineering, our method yields an improvement…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Explainable Artificial Intelligence (XAI)
MethodsContrastive Language-Image Pre-training · Heatmap
