Benchmarking Detection Transfer Learning with Vision Transformers

Yanghao Li; Saining Xie; Xinlei Chen; Piotr Dollar; Kaiming He; Ross; Girshick

arXiv:2111.11429·cs.CV·November 23, 2021·75 cites

Benchmarking Detection Transfer Learning with Vision Transformers

Yanghao Li, Saining Xie, Xinlei Chen, Piotr Dollar, Kaiming He, Ross, Girshick

PDF

Open Access 2 Repos

TL;DR

This paper benchmarks Vision Transformer models for object detection transfer learning, overcoming training challenges, and finds that masking-based self-supervised pre-training improves detection accuracy on COCO, especially for larger models.

Contribution

It introduces training techniques enabling ViT models as backbones in detection tasks and compares various initializations, highlighting the benefits of masking-based self-supervised learning.

Findings

01

Masking-based self-supervised pre-training improves detection accuracy.

02

Improvements increase with larger model sizes.

03

The methods enable effective use of ViT in detection benchmarks.

Abstract

Object detection is a central downstream task used to test if pre-trained network parameters confer benefits, such as improved accuracy or training speed. The complexity of object detection methods can make this benchmarking non-trivial when new architectures, such as Vision Transformer (ViT) models, arrive. These difficulties (e.g., architectural incompatibility, slow training, high memory consumption, unknown training formulae, etc.) have prevented recent studies from benchmarking detection transfer learning with standard ViT models. In this paper, we present training techniques that overcome these challenges, enabling the use of standard ViT models as the backbone of Mask R-CNN. These tools facilitate the primary goal of our study: we compare five ViT initializations, including recent state-of-the-art self-supervised learning methods, supervised initialization, and a strong random…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · COVID-19 diagnosis using AI

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Region Proposal Network · Adam · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing · Convolution · Residual Connection