MetaFormer Baselines for Vision

Weihao Yu; Chenyang Si; Pan Zhou; Mi Luo; Yichen Zhou; Jiashi Feng,; Shuicheng Yan; Xinchao Wang

arXiv:2210.13452·cs.CV·December 3, 2024·33 cites

MetaFormer Baselines for Vision

Weihao Yu, Chenyang Si, Pan Zhou, Mi Luo, Yichen Zhou, Jiashi Feng,, Shuicheng Yan, Xinchao Wang

PDF

Open Access 5 Repos 10 Models

TL;DR

MetaFormer demonstrates that simple, traditional token mixers can achieve state-of-the-art performance on vision tasks, establishing strong baselines and revealing the architecture's robustness and flexibility.

Contribution

The paper introduces baseline MetaFormer models with various token mixers, showing their effectiveness and setting new records on ImageNet-1K without complex designs.

Findings

01

IdentityFormer achieves >80% accuracy with identity token mixer.

02

RandFormer outperforms IdentityFormer with >81% accuracy using random token mixers.

03

CAFormer sets a new record with 85.5% accuracy on ImageNet-1K.

Abstract

MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · ConvNeXt · PoolFormer · MetaFormer · StarReLU · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer