Toward Transformer-Based Object Detection

Josh Beal; Eric Kim; Eric Tzeng; Dong Huk Park; Andrew Zhai; Dmitry; Kislyuk

arXiv:2012.09958·cs.CV·December 21, 2020·140 cites

Toward Transformer-Based Object Detection

Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, Dmitry, Kislyuk

PDF

Open Access

TL;DR

This paper explores using Vision Transformers as a backbone for object detection, demonstrating competitive results and advantages over traditional methods, marking progress toward pure-transformer vision models.

Contribution

It introduces ViT-FRCNN, a novel transformer-based detection model that achieves competitive COCO results and shows improvements in out-of-domain performance and object size handling.

Findings

01

ViT-FRCNN achieves competitive COCO detection results.

02

Transformer-based models show improved out-of-domain performance.

03

Less reliance on non-maximum suppression in detection.

Abstract

Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Layer Normalization · Dense Connections