Diverse Instance Discovery: Vision-Transformer for Instance-Aware Multi-Label Image Recognition
Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Feihu, Yan, Yuan He, Hui Xue

TL;DR
This paper introduces a novel Vision Transformer-based approach for multi-label image recognition that leverages instance-aware attention mechanisms and weakly supervised localization to improve performance.
Contribution
It proposes a new DiD framework with semantic and spatial modules, enhancing multi-label recognition without requiring strongly supervised data.
Findings
Achieves state-of-the-art results on benchmark datasets.
Effectively mines diverse instances using attention modules.
Outperforms previous CNN-based methods in multi-label recognition.
Abstract
Previous works on multi-label image recognition (MLIR) usually use CNNs as a starting point for research. In this paper, we take pure Vision Transformer (ViT) as the research base and make full use of the advantages of Transformer with long-range dependency modeling to circumvent the disadvantages of CNNs limited to local receptive field. However, for multi-label images containing multiple objects from different categories, scales, and spatial relations, it is not optimal to use global information alone. Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images, named diverse instance discovery (DiD). To this end, we propose a semantic category-aware module and a spatial relationship-aware module, respectively, and then combine the two by a re-constraint strategy to obtain instance-aware attention maps. Finally, we propose a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsText and Document Classification Technologies · Image Retrieval and Classification Techniques · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Balanced Selection · Position-Wise Feed-Forward Layer · Dense Connections · Softmax · Vision Transformer · Label Smoothing
