MViTv2: Improved Multiscale Vision Transformers for Classification and   Detection

Yanghao Li; Chao-Yuan Wu; Haoqi Fan; Karttikeya Mangalam; Bo Xiong,; Jitendra Malik; Christoph Feichtenhofer

arXiv:2112.01526·cs.CV·March 31, 2022·29 cites

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Yanghao Li, Chao-Yuan Wu, Haoqi Fan, Karttikeya Mangalam, Bo Xiong,, Jitendra Malik, Christoph Feichtenhofer

PDF

Open Access 5 Repos 10 Models

TL;DR

MViTv2 introduces an improved multiscale vision transformer architecture with decomposed relative positional embeddings and residual pooling, achieving state-of-the-art results across image classification, object detection, and video recognition tasks.

Contribution

The paper presents MViTv2, a unified multiscale vision transformer architecture with novel enhancements that outperform previous models in multiple visual recognition domains.

Findings

01

88.8% accuracy on ImageNet classification

02

58.7 boxAP on COCO detection

03

86.1% accuracy on Kinetics-400 video recognition

Abstract

In this paper, we study Multiscale Vision Transformers (MViTv2) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTv2s' pooling attention to window attention mechanisms where it outperforms the latter in accuracy/compute. Without bells-and-whistles, MViTv2 has state-of-the-art performance in 3 domains: 88.8% accuracy on ImageNet classification, 58.7 boxAP on COCO object detection as well as 86.1% on Kinetics-400 video classification. Code and models are available at https://github.com/facebookresearch/mvit.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMultiscale Vision Transformer