MViTv2: Improved Multiscale Vision Transformers for Classification and Detection
Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong,, Jitendra Malik, Christoph Feichtenhofer

TL;DR
MViTv2 introduces an improved multiscale vision transformer architecture with decomposed relative positional embeddings and residual pooling, achieving state-of-the-art results across image classification, object detection, and video recognition tasks.
Contribution
The paper presents MViTv2, a unified multiscale vision transformer architecture with novel enhancements that outperform previous models in multiple visual recognition domains.
Findings
88.8% accuracy on ImageNet classification
58.7 boxAP on COCO detection
86.1% accuracy on Kinetics-400 video recognition
Abstract
In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/mvitv2_base.fb_in1kmodel· 697 dl· ♡ 1697 dl♡ 1
- 🤗timm/mvitv2_base_cls.fb_inw21kmodel· 372 dl· ♡ 1372 dl♡ 1
- 🤗timm/mvitv2_huge_cls.fb_inw21kmodel· 35 dl35 dl
- 🤗timm/mvitv2_large.fb_in1kmodel· 82 dl· ♡ 182 dl♡ 1
- 🤗timm/mvitv2_large_cls.fb_inw21kmodel· 45 dl· ♡ 345 dl♡ 3
- 🤗timm/mvitv2_small.fb_in1kmodel· 666 dl666 dl
- 🤗timm/mvitv2_tiny.fb_in1kmodel· 332 dl· ♡ 1332 dl♡ 1
- 🤗birder-project/mvit_v2_t_il-allmodel· 27 dl27 dl
- 🤗birder-project/mvit_v2_s_yellowstonemodel· 21 dl21 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMultiscale Vision Transformer
