S$^2$-MLP: Spatial-Shift MLP Architecture for Vision

Tan Yu; Xu Li; Yunfeng Cai; Mingming Sun; Ping Li

arXiv:2106.07477·cs.CV·June 24, 2021·29 cites

S$^2$-MLP: Spatial-Shift MLP Architecture for Vision

Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

PDF

Open Access 1 Repo 1 Video

TL;DR

S$^2$-MLP introduces a simple, parameter-free spatial-shift operation in an MLP architecture, achieving high accuracy on ImageNet with fewer parameters and FLOPs, outperforming previous MLP models and rivaling ViT.

Contribution

It proposes a novel spatial-shift MLP architecture that simplifies token communication, improving performance on medium-scale datasets compared to existing MLP models.

Findings

01

S$^2$-MLP outperforms MLP-Mixer on ImageNet-1K.

02

Achieves comparable accuracy to ViT with fewer FLOPs and parameters.

03

Parameter-free spatial-shift operation enhances efficiency and effectiveness.

Abstract

Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dslisleedh/MLP_based_models-tensorflow2/blob/master/s2mlp.py
tf

Videos

S2-MLP: Spatial-Shift MLP Architecture for Vision· youtube

Taxonomy

TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Visual Attention and Saliency Detection

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Average Pooling · Global Average Pooling · MLP-Mixer · Byte Pair Encoding