Benchmarking Detection Transfer Learning with Vision Transformers
Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaiming He, Ross, Girshick

TL;DR
This paper benchmarks Vision Transformer models for object detection transfer learning, overcoming training challenges, and finds that masking-based self-supervised pre-training improves detection accuracy on COCO, especially for larger models.
Contribution
It introduces training techniques enabling ViT models as backbones in detection tasks and compares various initializations, highlighting the benefits of masking-based self-supervised learning.
Findings
Masking-based self-supervised pre-training improves detection accuracy.
Improvements increase with larger model sizes.
The methods enable effective use of ViT in detection benchmarks.
Abstract
Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Region Proposal Network · Adam · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Convolution · Residual Connection
