Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

Zibin Geng; Xuefeng Jiang; Jia Li; Zheng Li; Tian Wen; Lvhua Wu; Sheng Sun; Yuwei Wang; Min Liu

arXiv:2604.09532·cs.CV·April 13, 2026

Seeing is Believing: Robust Vision-Guided Cross-Modal Prompt Learning under Label Noise

Zibin Geng, Xuefeng Jiang, Jia Li, Zheng Li, Tian Wen, Lvhua Wu, Sheng Sun, Yuwei Wang, Min Liu

PDF

1 Repo

TL;DR

VisPrompt is a novel vision-guided prompt learning framework that enhances robustness of vision-language models under label noise by leveraging cross-modal attention and adaptive visual information injection.

Contribution

It introduces a lightweight, robust prompt learning method that uses visual semantics and adaptive modulation to mitigate label noise effects.

Findings

01

Outperforms existing methods on seven benchmark datasets.

02

Effectively suppresses noise-induced disturbances and reduces instability.

03

Maintains pretrained backbone with minimal additional parameters.

Abstract

Prompt learning is a parameter-efficient approach for vision-language models, yet its robustness under label noise is less investigated. Visual content contains richer and more reliable semantic information, which remains more robust under label noise. However, the prompt itself is highly susceptible to label noise. Motivated by this intuition, we propose VisPrompt, a lightweight and robust vision-guided prompt learning framework for noisy-label settings. Specifically, we exploit a cross-modal attention mechanism to reversely inject visual semantics into prompt representations. This enables the prompt tokens to selectively aggregate visual information relevant to the current sample, thereby improving robustness by anchoring prompt learning to stable instance-level visual evidence and reducing the influence of noisy supervision. To address the instability caused by using the same way of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gezbww/Vis_Prompt
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.