Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Wei Tang; Zuo-Zheng Wang; Kun Zhang; Tong Wei; and Min-Ling Zhang

arXiv:2511.20641·cs.CV·November 26, 2025

Unleashing the Power of Vision-Language Models for Long-Tailed Multi-Label Visual Recognition

Wei Tang, Zuo-Zheng Wang, Kun Zhang, Tong Wei, and Min-Ling Zhang

PDF

Open Access

TL;DR

This paper introduces CAPNET, a novel framework that leverages CLIP's textual encoder and label correlation modeling to improve long-tailed multi-label visual recognition, outperforming existing methods.

Contribution

The paper proposes CAPNET, which explicitly models label correlations and employs a distribution-balanced loss, test-time ensembling, and parameter-efficient fine-tuning to enhance multi-label recognition on imbalanced datasets.

Findings

01

CAPNET outperforms state-of-the-art methods on VOC-LT, COCO-LT, and NUS-WIDE.

02

Label correlation modeling improves tail class recognition.

03

Test-time ensembling and fine-tuning enhance generalization and robustness.

Abstract

Long-tailed multi-label visual recognition poses a significant challenge, as images typically contain multiple labels with highly imbalanced class distributions, leading to biased models that favor head classes while underperforming on tail classes. Recent efforts have leveraged pre-trained vision-language models, such as CLIP, alongside long-tailed learning techniques to exploit rich visual-textual priors for improved performance. However, existing methods often derive semantic inter-class relationships directly from imbalanced datasets, resulting in unreliable correlations for tail classes due to data scarcity. Moreover, CLIP's zero-shot paradigm is optimized for single-label image-text matching, making it suboptimal for multi-label tasks. To address these issues, we propose the correlation adaptation prompt network (CAPNET), a novel end-to-end framework that explicitly models label…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsText and Document Classification Technologies · Domain Adaptation and Few-Shot Learning · Topic Modeling