Dude: Dual Distribution-Aware Context Prompt Learning For Large   Vision-Language Model

Duy M. H. Nguyen; An T. Le; Trung Q. Nguyen; Nghiem T. Diep; Tai; Nguyen; Duy Duong-Tran; Jan Peters; Li Shen; Mathias Niepert; Daniel Sonntag

arXiv:2407.04489·cs.CV·July 8, 2024

Dude: Dual Distribution-Aware Context Prompt Learning For Large Vision-Language Model

Duy M. H. Nguyen, An T. Le, Trung Q. Nguyen, Nghiem T. Diep, Tai, Nguyen, Duy Duong-Tran, Jan Peters, Li Shen, Mathias Niepert, Daniel Sonntag

PDF

Open Access

TL;DR

This paper introduces Dude, a dual distribution-aware prompt learning framework for large vision-language models that leverages domain-shared and class-specific contexts, enhanced by optimal transport theory, to improve fine-grained classification performance.

Contribution

The paper proposes a novel dual prompt framework combined with Unbalanced Optimal Transport to better align visual tokens and prompts, addressing limitations of existing prompt methods in fine-grained tasks.

Findings

01

Outperforms state-of-the-art baselines in few-shot classification.

02

Effectively handles noisy and irrelevant elements through UOT-based partial matching.

03

Enhances feature representation with dual context prompts and optimal transport.

Abstract

Prompt learning methods are gaining increasing attention due to their ability to customize large vision-language models to new domains using pre-trained contextual knowledge and minimal training data. However, existing works typically rely on optimizing unified prompt inputs, often struggling with fine-grained classification tasks due to insufficient discriminative attributes. To tackle this, we consider a new framework based on a dual context of both domain-shared and class-specific contexts, where the latter is generated by Large Language Models (LLMs) such as GPTs. Such dual prompt methods enhance the model's feature representation by joining implicit and explicit factors encoded in LLM knowledge. Moreover, we formulate the Unbalanced Optimal Transport (UOT) theory to quantify the relationships between constructed prompts and visual tokens. Through partial matching, UOT can properly…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsSoftmax · Attention Is All You Need · Adapter · ALIGN