Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Zilong Huang; Youcheng Ben; Guozhong Luo; Pei Cheng; Gang Yu; Bin Fu

arXiv:2106.03650·cs.CV·June 8, 2021·124 cites

Shuffle Transformer: Rethinking Spatial Shuffle for Vision Transformer

Zilong Huang, Youcheng Ben, Guozhong Luo, Pei Cheng, Gang Yu, Bin Fu

PDF

Open Access 4 Repos

TL;DR

This paper introduces Shuffle Transformer, a novel vision transformer that enhances cross-window connections using spatial shuffle and depth-wise convolution, leading to improved performance across various visual tasks.

Contribution

It proposes a simple, efficient modification to window-based transformers using spatial shuffle and depth-wise convolution to strengthen cross-window communication.

Findings

01

Achieves state-of-the-art results on image classification.

02

Improves object detection and semantic segmentation performance.

03

Simple implementation with only minor code modifications.

Abstract

Very recently, Window-based Transformers, which computed self-attention within non-overlapping local windows, demonstrated promising results on image classification, semantic segmentation, and object detection. However, less study has been devoted to the cross-window connection which is the key element to improve the representation ability. In this work, we revisit the spatial shuffle as an efficient way to build connections among windows. As a result, we propose a new vision transformer, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code. Furthermore, the depth-wise convolution is introduced to complement the spatial shuffle for enhancing neighbor-window connections. The proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification, object detection, and semantic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Shuffle Transformer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Label Smoothing · Residual Connection