Learning Semantic-Aware Representation in Visual-Language Models for   Multi-Label Recognition with Partial Labels

Haoxian Ruan; Zhihua Xu; Zhijing Yang; Yongyi Lu; Jinghui Qin,; Tianshui Chen

arXiv:2412.10843·cs.CV·December 17, 2024

Learning Semantic-Aware Representation in Visual-Language Models for Multi-Label Recognition with Partial Labels

Haoxian Ruan, Zhihua Xu, Zhijing Yang, Yongyi Lu, Jinghui Qin,, Tianshui Chen

PDF

Open Access

TL;DR

This paper introduces a semantic decoupling and prompt optimization framework for CLIP-based multi-label recognition with partial labels, effectively reducing semantic confusion and improving accuracy on benchmark datasets.

Contribution

It proposes a novel semantic decoupling module and category-specific prompt optimization to enhance CLIP's performance in multi-label recognition with partial labels.

Findings

01

Significantly outperforms state-of-the-art methods on COCO and Pascal VOC datasets.

02

Effectively separates category information, reducing semantic confusion.

03

Achieves better performance with a simpler model structure.

Abstract

Multi-label recognition with partial labels (MLR-PL), in which only some labels are known while others are unknown for each image, is a practical task in computer vision, since collecting large-scale and complete multi-label datasets is difficult in real application scenarios. Recently, vision language models (e.g. CLIP) have demonstrated impressive transferability to downstream tasks in data limited or label limited settings. However, current CLIP-based methods suffer from semantic confusion in MLR task due to the lack of fine-grained information in the single global visual and textual representation for all categories. In this work, we address this problem by introducing a semantic decoupling module and a category-specific prompt optimization method in CLIP-based framework. Specifically, the semantic decoupling module following the visual encoder learns category-specific feature maps…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies

MethodsSoftmax · Attention Is All You Need