General Multi-label Image Classification with Transformers
Jack Lanchantin, Tianlu Wang, Vicente Ordonez, Yanjun Qi

TL;DR
This paper introduces the Classification Transformer (C-Tran), a novel framework leveraging Transformers for multi-label image classification that captures label dependencies and handles uncertain labels, achieving state-of-the-art results on multiple datasets.
Contribution
The paper presents C-Tran, a Transformer-based model with a label mask training objective that explicitly models label uncertainty and improves multi-label classification performance.
Findings
Achieves state-of-the-art results on COCO and Visual Genome datasets.
Effectively handles partial and extra label annotations during inference.
Demonstrates improved performance across multiple diverse image datasets.
Abstract
Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. In this work we propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Our approach consists of a Transformer encoder trained to predict a set of target labels given an input set of masked labels, and visual features from a convolutional neural network. A key ingredient of our method is a label mask training objective that uses a ternary encoding scheme to represent the state of the labels as positive, negative, or unknown during training. Our model shows state-of-the-art performance on challenging datasets such as COCO and Visual Genome. Moreover, because our model explicitly represents the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam · Label Smoothing
