S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

Tan Yu; Xu Li; Yunfeng Cai; Mingming Sun; Ping Li

arXiv:2108.01072·cs.CV·August 3, 2021·30 cites

S$^2$-MLPv2: Improved Spatial-Shift MLP Architecture for Vision

Tan Yu, Xu Li, Yunfeng Cai, Mingming Sun, Ping Li

PDF

Open Access 3 Repos

TL;DR

S$^2$-MLPv2 enhances the spatial-shift MLP architecture for vision tasks by expanding features, applying split operations, and using a pyramid structure, achieving state-of-the-art accuracy without self-attention.

Contribution

The paper introduces S$^2$-MLPv2, an improved vision backbone that incorporates feature expansion, split-attention, and pyramid structures for better image recognition performance.

Findings

01

Achieves 83.6% top-1 accuracy on ImageNet-1K.

02

Outperforms previous MLP-based models without self-attention.

03

Uses 55M parameters for a competitive medium-scale model.

Abstract

Recently, MLP-based vision backbones emerge. MLP-based vision architectures with less inductive bias achieve competitive performance in image recognition compared with CNNs and vision Transformers. Among them, spatial-shift MLP (S $^{2}$ -MLP), adopting the straightforward spatial-shift operation, achieves better performance than the pioneering works including MLP-mixer and ResMLP. More recently, using smaller patches with a pyramid structure, Vision Permutator (ViP) and Global Filter Network (GFNet) achieve better performance than S $^{2}$ -MLP. In this paper, we improve the S $^{2}$ -MLP vision backbone. We expand the feature map along the channel dimension and split the expanded feature map into several parts. We conduct different spatial-shift operations on split parts. Meanwhile, we exploit the split-attention operation to fuse these split parts. Moreover, like the counterparts, we adopt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Advanced Vision and Imaging · Advanced Neural Network Applications

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Feedforward Network · Affine Operator · Residual Multi-Layer Perceptrons · Average Pooling · Dropout · Global Average Pooling · Layer Normalization · Dense Connections · Residual Connection