Exploring Plain Vision Transformer Backbones for Object Detection
Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He

TL;DR
This paper investigates the use of plain, non-hierarchical Vision Transformer backbones for object detection, demonstrating competitive performance with minimal architectural modifications and simple feature pyramid design.
Contribution
It introduces ViTDet, a plain-backbone detector using ViT pre-trained as MAE, achieving competitive results without hierarchical backbones or complex attention mechanisms.
Findings
Plain ViT backbones can be effectively used for object detection.
A simple feature pyramid from a single-scale feature map suffices.
Using window attention without shifting is adequate for strong performance.
Abstract
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Feature Pyramid Network · Softmax · Absolute Position Encodings · Layer Normalization · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer
