Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li; Hanzi Mao; Ross Girshick; Kaiming He

arXiv:2203.16527·cs.CV·June 13, 2022·42 cites

Exploring Plain Vision Transformer Backbones for Object Detection

Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He

PDF

Open Access 5 Repos

TL;DR

This paper investigates the use of plain, non-hierarchical Vision Transformer backbones for object detection, demonstrating competitive performance with minimal architectural modifications and simple feature pyramid design.

Contribution

It introduces ViTDet, a plain-backbone detector using ViT pre-trained as MAE, achieving competitive results without hierarchical backbones or complex attention mechanisms.

Findings

01

Plain ViT backbones can be effectively used for object detection.

02

A simple feature pyramid from a single-scale feature map suffices.

03

Using window attention without shifting is adequate for strong performance.

Abstract

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Feature Pyramid Network · Softmax · Absolute Position Encodings · Layer Normalization · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer