Visual Transformers with Primal Object Queries for Multi-Label Image Classification
Vacit Oguz Yazici, Joost van de Weijer, Longlong Yu

TL;DR
This paper introduces primal object queries in vision transformers for multi-label image classification, enhancing performance and convergence speed over previous methods.
Contribution
It proposes a novel use of primal object queries only at the start of the transformer decoder, improving training efficiency and accuracy.
Findings
Improves class-wise F1 score by 2.1% on MS-COCO
Speeds up convergence by 79% on MS-COCO
Achieves state-of-the-art results on NUS-WIDE
Abstract
Multi-label image classification is about predicting a set of class labels that can be considered as orderless sequential data. Transformers process the sequential data as a whole, therefore they are inherently good at set prediction. The first vision-based transformer model, which was proposed for the object detection task introduced the concept of object queries. Object queries are learnable positional encodings that are used by attention modules in decoder layers to decode the object classes or bounding boxes using the region of interests in an image. However, inputting the same set of object queries to different decoder layers hinders the training: it results in lower performance and delays convergence. In this paper, we propose the usage of primal object queries that are only provided at the start of the transformer decoder stack. In addition, we improve the mixup technique…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Image Retrieval and Classification Techniques
MethodsMixup
