DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models

Niloufar Alipour Talemi; Hossein Kashiani; Hossein R. Nowdeh; Fatemeh Afghah

arXiv:2505.19373·cs.CV·May 27, 2025

DiSa: Directional Saliency-Aware Prompt Learning for Generalizable Vision-Language Models

Niloufar Alipour Talemi, Hossein Kashiani, Hossein R. Nowdeh, Fatemeh Afghah

PDF

Open Access

TL;DR

DiSa introduces a novel prompt learning framework that enhances vision-language model generalization by integrating saliency-aware masking and directional regularization, leading to superior performance across diverse image classification tasks.

Contribution

The paper proposes DiSa, a new prompt learning method combining saliency-guided masking and directional embedding regularization to improve generalization in vision-language models.

Findings

01

Outperforms state-of-the-art methods on 11 benchmarks

02

Effective in base-to-novel, cross-dataset, domain, and few-shot scenarios

03

Enhances cross-modal alignment and feature robustness

Abstract

Prompt learning has emerged as a powerful paradigm for adapting vision-language models such as CLIP to downstream tasks. However, existing methods often overfit to seen data, leading to significant performance degradation when generalizing to novel classes or unseen domains. To address this limitation, we propose DiSa, a Directional Saliency-Aware Prompt Learning framework that integrates two complementary regularization strategies to enhance generalization. First, our Cross-Interactive Regularization (CIR) fosters cross-modal alignment by enabling cooperative learning between prompted and frozen encoders. Within CIR, a saliency-aware masking strategy guides the image encoder to prioritize semantically critical image regions, reducing reliance on less informative patches. Second, we introduce a directional regularization strategy that aligns visual embeddings with class-wise prototype…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Language-Image Pre-training