Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

Shiming Chen; Bowen Duan; Salman Khan; Fahad Shahbaz Khan

arXiv:2506.23822·cs.CV·July 1, 2025

Interpretable Zero-Shot Learning with Locally-Aligned Vision-Language Model

Shiming Chen, Bowen Duan, Salman Khan, Fahad Shahbaz Khan

PDF

Open Access 1 Models

TL;DR

This paper introduces LaZSL, a locally-aligned vision-language model that enhances interpretability and accuracy in zero-shot learning by aligning visual regions with attributes using optimal transport, without extra training.

Contribution

LaZSL is the first model to align local visual features with attributes via optimal transport for interpretable zero-shot learning, improving both interpretability and performance.

Findings

01

Enhanced interpretability of ZSL predictions.

02

Improved accuracy over baseline models.

03

Strong domain generalization demonstrated.

Abstract

Large-scale vision-language models (VLMs), such as CLIP, have achieved remarkable success in zero-shot learning (ZSL) by leveraging large-scale visual-text pair datasets. However, these methods often lack interpretability, as they compute the similarity between an entire query image and the embedded category words, making it difficult to explain their predictions. One approach to address this issue is to develop interpretable models by integrating language, where classifiers are built using discrete attributes, similar to human perception. This introduces a new challenge: how to effectively align local visual features with corresponding attributes based on pre-trained VLMs. To tackle this, we propose LaZSL, a locally-aligned vision-language model for interpretable ZSL. LaZSL employs local visual-semantic alignment via optimal transport to perform interaction between visual regions and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
KimingChen/LaZSL
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Explainable Artificial Intelligence (XAI)