ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang; Yubin Yang

arXiv:2105.13677·cs.CV·October 15, 2021·148 cites

ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang, Yubin Yang

PDF

Open Access 5 Repos 1 Video

TL;DR

ResT introduces an efficient multi-scale vision Transformer with novel memory-efficient attention, flexible position encoding, and overlapping convolution patch embedding, achieving superior performance on image recognition tasks.

Contribution

It proposes a new Transformer backbone with memory-efficient attention, flexible position encoding, and overlapping convolution patch embedding, improving efficiency and accuracy.

Findings

01

Outperforms state-of-the-art backbones in image classification

02

Demonstrates strong results on downstream vision tasks

03

Shows efficiency gains over existing Transformer models

Abstract

This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

ResT: An Efficient Transformer for Visual Recognition· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Convolution · Dense Connections · Residual Connection · Layer Normalization