MetaFormer Baselines for Vision
Weihao Yu, Chenyang Si, Pan Zhou, Mi Luo, Yichen Zhou, Jiashi Feng,, Shuicheng Yan, Xinchao Wang

TL;DR
MetaFormer demonstrates that simple, traditional token mixers can achieve state-of-the-art performance on vision tasks, establishing strong baselines and revealing the architecture's robustness and flexibility.
Contribution
The paper introduces baseline MetaFormer models with various token mixers, showing their effectiveness and setting new records on ImageNet-1K without complex designs.
Findings
IdentityFormer achieves >80% accuracy with identity token mixer.
RandFormer outperforms IdentityFormer with >81% accuracy using random token mixers.
CAFormer sets a new record with 85.5% accuracy on ImageNet-1K.
Abstract
MetaFormer, the abstracted architecture of Transformer, has been found to play a significant role in achieving competitive performance. In this paper, we further explore the capacity of MetaFormer, again, without focusing on token mixer design: we introduce several baseline models under MetaFormer using the most basic or common mixers, and summarize our observations as follows: (1) MetaFormer ensures solid lower bound of performance. By merely adopting identity mapping as the token mixer, the MetaFormer model, termed IdentityFormer, achieves >80% accuracy on ImageNet-1K. (2) MetaFormer works well with arbitrary token mixers. When specifying the token mixer as even a random matrix to mix tokens, the resulting model RandFormer yields an accuracy of >81%, outperforming IdentityFormer. Rest assured of MetaFormer's results when new token mixers are adopted. (3) MetaFormer effortlessly offers…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/caformer_b36.sail_in1kmodel· 378 dl378 dl
- 🤗timm/caformer_b36.sail_in1k_384model· 149 dl149 dl
- 🤗timm/caformer_b36.sail_in22kmodel· 170 dl170 dl
- 🤗timm/caformer_b36.sail_in22k_ft_in1kmodel· 2.3k dl· ♡ 22.3k dl♡ 2
- 🤗timm/caformer_b36.sail_in22k_ft_in1k_384model· 2.0k dl2.0k dl
- 🤗timm/caformer_m36.sail_in1kmodel· 457 dl457 dl
- 🤗timm/caformer_m36.sail_in1k_384model· 145 dl145 dl
- 🤗timm/caformer_m36.sail_in22kmodel· 148 dl148 dl
- 🤗timm/caformer_m36.sail_in22k_ft_in1kmodel· 203 dl203 dl
- 🤗timm/caformer_m36.sail_in22k_ft_in1k_384model· 1.2k dl· ♡ 21.2k dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · ConvNeXt · PoolFormer · MetaFormer · StarReLU · Linear Layer · Dense Connections · Softmax · Position-Wise Feed-Forward Layer
