MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu; Hossein Talebi; Han Zhang; Feng Yang; Peyman Milanfar,; Alan Bovik; Yinxiao Li

arXiv:2204.01697·cs.CV·September 12, 2022·26 cites

MaxViT: Multi-Axis Vision Transformer

Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar,, Alan Bovik, Yinxiao Li

PDF

Open Access 5 Repos 10 Models

TL;DR

MaxViT introduces a scalable multi-axis attention mechanism combining local and global attention, enabling efficient vision transformers that excel in classification, detection, and generative tasks.

Contribution

The paper proposes a novel multi-axis attention model integrated into a hierarchical vision backbone, MaxViT, which achieves state-of-the-art results across multiple vision tasks.

Findings

01

MaxViT achieves 86.5% ImageNet-1K top-1 accuracy without extra data.

02

With ImageNet-21K pre-training, MaxViT reaches 88.7% accuracy.

03

MaxViT demonstrates strong performance in object detection and generative modeling.

Abstract

Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Advanced Image and Video Retrieval Techniques