Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model
Shiming Chen, Bowen Duan, Salman Khan, Fahad Shahbaz Khan

TL;DR
This paper introduces LaZSL, a locally-aligned vision-language model that enhances interpretability and accuracy in zero-shot learning by aligning visual regions with attributes using optimal transport, without extra training.
Contribution
LaZSL is the first model to align local visual features with attributes via optimal transport for interpretable zero-shot learning, improving both interpretability and performance.
Findings
Enhanced interpretability of ZSL predictions.
Improved accuracy over baseline models.
Strong domain generalization demonstrated.
Abstract
Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)
