S$^2$-MLP: Spatial-Shift MLP Architecture for Vision
Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

TL;DR
S$^2$-MLP introduces a simple, parameter-free spatial-shift operation in an MLP architecture, achieving high accuracy on ImageNet with fewer parameters and FLOPs, outperforming previous MLP models and rivaling ViT.
Contribution
It proposes a novel spatial-shift MLP architecture that simplifies token communication, improving performance on medium-scale datasets compared to existing MLP models.
Findings
S$^2$-MLP outperforms MLP-Mixer on ImageNet-1K.
Achieves comparable accuracy to ViT with fewer FLOPs and parameters.
Parameter-free spatial-shift operation enhances efficiency and effectiveness.
Abstract
Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
S2-MLP: Spatial-Shift MLP Architecture for Vision· youtube
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Average Pooling · Global Average Pooling · MLP-Mixer · Byte Pair Encoding
