Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Hao Zheng; Shunzhi Yang; Zhuoxin He; Jinfeng Yang; Zhenhua Huang

arXiv:2507.14976·cs.CV·August 15, 2025

Hierarchical Cross-modal Prompt Learning for Vision-Language Models

Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang

PDF

Open Access

TL;DR

This paper introduces HiCroPL, a hierarchical cross-modal prompt learning framework for vision-language models that enhances generalization by enabling bidirectional knowledge flow between text and vision modalities, leading to state-of-the-art results.

Contribution

It proposes a novel hierarchical knowledge mapper and cross-modal prompt mechanism that improve semantic alignment and generalization in vision-language models.

Findings

01

Achieves state-of-the-art results on 11 benchmarks.

02

Enhances semantic alignment between text and vision modalities.

03

Improves generalization across multiple downstream tasks.

Abstract

Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications