PVLR: Prompt-driven Visual-Linguistic Representation Learning for   Multi-Label Image Recognition

Hao Tan; Zichang Tan; Jun Li; Jun Wan; Zhen Lei

arXiv:2401.17881·cs.CV·February 1, 2024·1 cites

PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition

Hao Tan, Zichang Tan, Jun Li, Jun Wan, Zhen Lei

PDF

Open Access

TL;DR

This paper introduces PVLR, a novel framework for multi-label image recognition that leverages dual-prompting and bidirectional fusion of visual and linguistic features to improve label prediction accuracy.

Contribution

The paper proposes a dual-prompting strategy and a bidirectional fusion module to better utilize language knowledge and enhance multi-label recognition performance.

Findings

01

PVLR outperforms existing methods on MS-COCO, Pascal VOC 2007, and NUS-WIDE datasets.

02

The dual-prompting strategy effectively captures label semantics and relationships.

03

Bidirectional fusion improves the interaction between visual and linguistic features.

Abstract

Multi-label image recognition is a fundamental task in computer vision. Recently, vision-language models have made notable advancements in this area. However, previous methods often failed to effectively leverage the rich knowledge within language models and instead incorporated label semantics into visual features in a unidirectional manner. In this paper, we propose a Prompt-driven Visual-Linguistic Representation Learning (PVLR) framework to better leverage the capabilities of the linguistic modality. In PVLR, we first introduce a dual-prompting strategy comprising Knowledge-Aware Prompting (KAP) and Context-Aware Prompting (CAP). KAP utilizes fixed prompts to capture the intrinsic semantic knowledge and relationships across all labels, while CAP employs learnable prompts to capture context-aware label semantics and relationships. Later, we propose an Interaction and Fusion Module…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications