CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Maoyuan Shao; Yutong Gao; Xinyang Huang; Chuang Zhu; Lijuan Sun; Guoshun Nan

arXiv:2603.02557·cs.CV·March 4, 2026

CAPT: Confusion-Aware Prompt Tuning for Reducing Vision-Language Misalignment

Maoyuan Shao, Yutong Gao, Xinyang Huang, Chuang Zhu, Lijuan Sun, Guoshun Nan

PDF

Open Access

TL;DR

CAPT introduces a confusion-aware prompt tuning framework for vision-language models, explicitly modeling and reducing class confusion to improve discriminability and generalization across multiple datasets.

Contribution

The paper proposes a novel framework with a confusion bank, semantic and sample confusion miners, and a multi-granularity expert to address systematic misclassifications in vision-language models.

Findings

01

Significantly reduces confusion-induced errors.

02

Improves discriminability and generalization on 11 datasets.

03

Resolves over 50% of confusable sample pairs.

Abstract

Vision-language models like CLIP have achieved remarkable progress in cross-modal representation learning, yet suffer from systematic misclassifications among visually and semantically similar categories. We observe that such confusion patterns are not random but persistently occur between specific category pairs, revealing the model's intrinsic bias and limited fine-grained discriminative ability. To address this, we propose CAPT, a Confusion-Aware Prompt Tuning framework that enables models to learn from their own misalignment. Specifically, we construct a Confusion Bank to explicitly model stable confusion relationships across categories and misclassified samples. On this basis, we introduce a Semantic Confusion Miner (SEM) to capture global inter-class confusion through semantic difference and commonality prompts, and a Sample Confusion Miner (SAM) to retrieve representative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling