ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Hwanjun Song; Deqing Sun; Sanghyuk Chun; Varun Jampani; Dongyoon Han,; Byeongho Heo; Wonjae Kim; Ming-Hsuan Yang

arXiv:2110.03921·cs.CV·November 30, 2021·46 cites

ViDT: An Efficient and Effective Fully Transformer-based Object Detector

Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jampani, Dongyoon Han,, Byeongho Heo, Wonjae Kim, Ming-Hsuan Yang

PDF

Open Access 1 Repo 1 Video

TL;DR

ViDT is a fully transformer-based object detector that combines vision and detection transformers, achieving high accuracy and efficiency on the COCO benchmark by extending Swin Transformer with a novel attention module and a multi-scale transformer decoder.

Contribution

This paper introduces ViDT, a new fully transformer-based object detection architecture that enhances detection performance while maintaining computational efficiency.

Findings

01

Achieves 49.2 AP on COCO dataset.

02

Provides the best AP and latency trade-off among similar models.

03

Demonstrates high scalability for large models.

Abstract

Transformers are transforming the landscape of computer vision, especially for recognition tasks. Detection transformers are the first fully end-to-end learning systems for object detection, while vision transformers are the first fully transformer-based architecture for image classification. In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector, followed by a computationally efficient transformer decoder that exploits multi-scale features and auxiliary techniques essential to boost the detection performance without much increase in computational load. Extensive evaluation results on the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP and latency trade-off among existing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

naver-ai/vidt
pytorchOfficial

Videos

ViDT: An Efficient and Effective Fully Transformer-based Object Detector· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Stochastic Depth · Residual Connection · Dropout · Layer Normalization · Position-Wise Feed-Forward Layer · Label Smoothing · Byte Pair Encoding