HPT++: Hierarchically Prompting Vision-Language Models with   Multi-Granularity Knowledge Generation and Improved Structure Modeling

Yubin Wang; Xinyang Jiang; De Cheng; Wenli Sun; Dongsheng Li; Cairong; Zhao

arXiv:2408.14812·cs.CV·August 28, 2024

HPT++: Hierarchically Prompting Vision-Language Models with Multi-Granularity Knowledge Generation and Improved Structure Modeling

Yubin Wang, Xinyang Jiang, De Cheng, Wenli Sun, Dongsheng Li, Cairong, Zhao

PDF

Open Access 2 Repos

TL;DR

HPT++ introduces a hierarchical prompt tuning approach that leverages structured knowledge generation and multi-level modeling to enhance vision-language model adaptation, outperforming existing methods across various evaluation settings.

Contribution

The paper proposes HPT++, a novel hierarchical prompt tuning framework that explicitly models structured knowledge and multi-granularity information for improved vision-language understanding.

Findings

01

HPT++ outperforms state-of-the-art methods in multiple evaluation scenarios.

02

Incorporating structured knowledge improves prompt effectiveness.

03

Hierarchical modeling captures complex relationships better than flat approaches.

Abstract

Prompt learning has become a prevalent strategy for adapting vision-language foundation models (VLMs) such as CLIP to downstream tasks. With the emergence of large language models (LLMs), recent studies have explored the potential of using category-related descriptions to enhance prompt effectiveness. However, conventional descriptions lack explicit structured information necessary to represent the interconnections among key elements like entities or attributes with relation to a particular category. Since existing prompt tuning methods give little consideration to managing structured knowledge, this paper advocates leveraging LLMs to construct a graph for each description to prioritize such structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), enabling simultaneous modeling of both structured and conventional linguistic knowledge.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSemantic Web and Ontologies · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training