General Multi-label Image Classification with Transformers

Jack Lanchantin; Tianlu Wang; Vicente Ordonez; Yanjun Qi

arXiv:2011.14027·cs.CV·December 1, 2020·20 cites

General Multi-label Image Classification with Transformers

Jack Lanchantin, Tianlu Wang, Vicente Ordonez, Yanjun Qi

PDF

Open Access 2 Repos

TL;DR

This paper introduces the Classification Transformer (C-Tran), a novel framework leveraging Transformers for multi-label image classification that captures label dependencies and handles uncertain labels, achieving state-of-the-art results on multiple datasets.

Contribution

The paper presents C-Tran, a Transformer-based model with a label mask training objective that explicitly models label uncertainty and improves multi-label classification performance.

Findings

01

Achieves state-of-the-art results on COCO and Visual Genome datasets.

02

Effectively handles partial and extra label annotations during inference.

03

Demonstrates improved performance across multiple diverse image datasets.

Abstract

Multi-label image classification is the task of predicting a set of labels corresponding to objects, attributes or other entities present in an image. In this work we propose the Classification Transformer (C-Tran), a general framework for multi-label image classification that leverages Transformers to exploit the complex dependencies among visual features and labels. Our approach consists of a Transformer encoder trained to predict a set of target labels given an input set of masked labels, and visual features from a convolutional neural network. A key ingredient of our method is a label mask training objective that uses a ternary encoding scheme to represent the state of the labels as positive, negative, or unknown during training. Our model shows state-of-the-art performance on challenging datasets such as COCO and Visual Genome. Moreover, because our model explicitly represents the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Residual Connection · Attention Is All You Need · Multi-Head Attention · Softmax · Adam · Label Smoothing