Bottleneck Transformers for Visual Recognition

Aravind Srinivas; Tsung-Yi Lin; Niki Parmar; Jonathon Shlens; Pieter; Abbeel; Ashish Vaswani

arXiv:2101.11605·cs.CV·August 4, 2021

Bottleneck Transformers for Visual Recognition

Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter, Abbeel, Ashish Vaswani

PDF

5 Repos 6 Models

TL;DR

BoTNet introduces a simple modification to ResNet by replacing some convolutions with self-attention, significantly improving performance in image recognition, detection, and segmentation tasks with minimal latency increase.

Contribution

The paper demonstrates that integrating self-attention into ResNet bottleneck blocks enhances vision tasks and provides a new perspective on viewing these blocks as Transformer components.

Findings

01

Achieves 44.4% Mask AP on COCO with Mask R-CNN.

02

Attains 84.7% top-1 accuracy on ImageNet.

03

Faster and more parameter-efficient than comparable models.

Abstract

We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsRegion Proposal Network · guidence~How to file a complaint against Expedia? · Batch Normalization · Split Attention · Max Pooling · 1x1 Convolution · Pointwise Convolution · ResNeSt · Attention Is All You Need · Residual Connection