MaxViT: Multi-Axis Vision Transformer
Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar,, Alan Bovik, Yinxiao Li

TL;DR
MaxViT introduces a scalable multi-axis attention mechanism combining local and global attention, enabling efficient vision transformers that excel in classification, detection, and generative tasks.
Contribution
The paper proposes a novel multi-axis attention model integrated into a hierarchical vision backbone, MaxViT, which achieves state-of-the-art results across multiple vision tasks.
Findings
MaxViT achieves 86.5% ImageNet-1K top-1 accuracy without extra data.
With ImageNet-21K pre-training, MaxViT reaches 88.7% accuracy.
MaxViT demonstrates strong performance in object detection and generative modeling.
Abstract
Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗timm/maxvit_base_tf_224.in1kmodel· 2.1k dl· ♡ 12.1k dl♡ 1
- 🤗timm/maxvit_base_tf_384.in1kmodel· 831 dl· ♡ 1831 dl♡ 1
- 🤗timm/maxvit_base_tf_384.in21k_ft_in1kmodel· 521 dl521 dl
- 🤗timm/maxvit_base_tf_512.in1kmodel· 2.8k dl2.8k dl
- 🤗timm/maxvit_base_tf_512.in21k_ft_in1kmodel· 757 dl· ♡ 1757 dl♡ 1
- 🤗timm/maxvit_large_tf_224.in1kmodel· 289 dl· ♡ 1289 dl♡ 1
- 🤗timm/maxvit_large_tf_384.in1kmodel· 376 dl376 dl
- 🤗timm/maxvit_large_tf_384.in21k_ft_in1kmodel· 286 dl286 dl
- 🤗timm/maxvit_large_tf_512.in1kmodel· 955 dl· ♡ 1955 dl♡ 1
- 🤗timm/maxvit_large_tf_512.in21k_ft_in1kmodel· 312 dl312 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques
