ResT: An Efficient Transformer for Visual Recognition
Qinglong Zhang, Yubin Yang

TL;DR
ResT introduces an efficient multi-scale vision Transformer with novel memory-efficient attention, flexible position encoding, and overlapping convolution patch embedding, achieving superior performance on image recognition tasks.
Contribution
It proposes a new Transformer backbone with memory-efficient attention, flexible position encoding, and overlapping convolution patch embedding, improving efficiency and accuracy.
Findings
Outperforms state-of-the-art backbones in image classification
Demonstrates strong results on downstream vision tasks
Shows efficiency gains over existing Transformer models
Abstract
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Convolution · Dense Connections · Residual Connection · Layer Normalization
