Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

Yifan Wang; Tao Wang; Chenwei Tang; Caiyang Yu; Zhengqing Zang; Mengmi Zhang; Shudong Huang; Jiancheng Lv

arXiv:2508.04028·cs.CV·August 7, 2025

Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval

Yifan Wang, Tao Wang, Chenwei Tang, Caiyang Yu, Zhengqing Zang, Mengmi Zhang, Shudong Huang, Jiancheng Lv

PDF

TL;DR

This paper introduces DCAR, a dual prompt learning framework that enhances vision-language models for image-text retrieval by dynamically adjusting prompts to better discriminate fine-grained attributes and subcategories.

Contribution

The paper proposes a novel dual prompt learning approach with joint category-attribute reweighting to improve fine-grained image-text matching in downstream retrieval tasks.

Findings

01

DCAR achieves state-of-the-art results on the FDRD benchmark.

02

Dynamic prompt adjustment improves fine-grained attribute discrimination.

03

Joint optimization of attribute and category features enhances retrieval accuracy.

Abstract

Recently, prompt learning has demonstrated remarkable success in adapting pre-trained Vision-Language Models (VLMs) to various downstream tasks such as image classification. However, its application to the downstream Image-Text Retrieval (ITR) task is more challenging. We find that the challenge lies in discriminating both fine-grained attributes and similar subcategories of the downstream data. To address this challenge, we propose Dual prompt Learning with Joint Category-Attribute Reweighting (DCAR), a novel dual-prompt learning framework to achieve precise image-text matching. The framework dynamically adjusts prompt vectors from both semantic and visual dimensions to improve the performance of CLIP on the downstream ITR task. Based on the prompt paradigm, DCAR jointly optimizes attribute and class features to enhance fine-grained representation learning. Specifically, (1) at the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.