MetaFormer Is Actually What You Need for Vision

Weihao Yu; Mi Luo; Pan Zhou; Chenyang Si; Yichen Zhou; Xinchao Wang,; Jiashi Feng; Shuicheng Yan

arXiv:2111.11418·cs.CV·July 5, 2022·29 cites

MetaFormer Is Actually What You Need for Vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang,, Jiashi Feng, Shuicheng Yan

PDF

Open Access 5 Repos 6 Models

TL;DR

This paper introduces MetaFormer, a general architecture for vision tasks that emphasizes the importance of the overall structure over specific token mixers, demonstrated by a simple PoolFormer model achieving competitive results.

Contribution

The paper proposes MetaFormer as a new perspective, showing that the architecture itself is more crucial than the specific token mixer, and introduces PoolFormer as a simple yet effective baseline.

Findings

01

PoolFormer achieves 82.1% top-1 accuracy on ImageNet-1K.

02

PoolFormer outperforms some Transformer/MLP baselines with fewer parameters and computations.

03

MetaFormer architecture is key to recent high-performance vision models.

Abstract

Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in Transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the Transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in Transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques

MethodsAttention Is All You Need · MetaFormer · PoolFormer · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings