MetaFormer Is Actually What You Need for Vision
Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang,, Jiashi Feng, Shuicheng Yan

TL;DR
This paper introduces MetaFormer, a general architecture for vision tasks that emphasizes the importance of the overall structure over specific token mixers, demonstrated by a simple PoolFormer model achieving competitive results.
Contribution
The paper proposes MetaFormer as a new perspective, showing that the architecture itself is more crucial than the specific token mixer, and introduces PoolFormer as a simple yet effective baseline.
Findings
PoolFormer achieves 82.1% top-1 accuracy on ImageNet-1K.
PoolFormer outperforms some Transformer/MLP baselines with fewer parameters and computations.
MetaFormer architecture is key to recent high-performance vision models.
Abstract
Transformers have shown great potential in computer vision tasks. A common belief is their attention-based token mixer module contributes most to their competence. However, recent works show the attention-based module in Transformers can be replaced by spatial MLPs and the resulted models still perform quite well. Based on this observation, we hypothesize that the general architecture of the Transformers, instead of the specific token mixer module, is more essential to the model's performance. To verify this, we deliberately replace the attention module in Transformers with an embarrassingly simple spatial pooling operator to conduct only basic token mixing. Surprisingly, we observe that the derived model, termed as PoolFormer, achieves competitive performance on multiple computer vision tasks. For example, on ImageNet-1K, PoolFormer achieves 82.1% top-1 accuracy, surpassing well-tuned…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · MetaFormer · PoolFormer · Linear Layer · Softmax · Multi-Head Attention · Residual Connection · Dense Connections · Byte Pair Encoding · Absolute Position Encodings
