TL;DR
VisPrompt is a novel vision-guided prompt learning framework that enhances robustness of vision-language models under label noise by leveraging cross-modal attention and adaptive visual information injection.
Contribution
It introduces a lightweight, robust prompt learning method that uses visual semantics and adaptive modulation to mitigate label noise effects.
Findings
Outperforms existing methods on seven benchmark datasets.
Effectively suppresses noise-induced disturbances and reduces instability.
Maintains pretrained backbone with minimal additional parameters.
Abstract
Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
