Toward Transformer-Based Object Detection
Josh Beal, Eric Kim, Eric Tzeng, Dong Huk Park, Andrew Zhai, Dmitry, Kislyuk

TL;DR
This paper explores using Vision Transformers as a backbone for object detection, demonstrating competitive results and advantages over traditional methods, marking progress toward pure-transformer vision models.
Contribution
It introduces ViT-FRCNN, a novel transformer-based detection model that achieves competitive COCO results and shows improvements in out-of-domain performance and object size handling.
Findings
ViT-FRCNN achieves competitive COCO detection results.
Transformer-based models show improved out-of-domain performance.
Less reliance on non-maximum suppression in detection.
Abstract
Transformers have become the dominant model in natural language processing, owing to their ability to pretrain on massive amounts of data, then transfer to smaller, more specific tasks via fine-tuning. The Vision Transformer was the first major attempt to apply a pure transformer model directly to images as input, demonstrating that as compared to convolutional networks, transformer-based architectures can achieve competitive results on benchmark classification tasks. However, the computational complexity of the attention operator means that we are limited to low-resolution inputs. For more complex tasks such as detection or segmentation, maintaining a high input resolution is crucial to ensure that models can properly identify and reflect fine details in their output. This naturally raises the question of whether or not transformer-based architectures such as the Vision Transformer are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Video Surveillance and Tracking Methods
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Byte Pair Encoding · Label Smoothing · Dropout · Adam · Layer Normalization · Dense Connections
