Multiscale Vision Transformers
Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan,, Jitendra Malik, Christoph Feichtenhofer

TL;DR
Multiscale Vision Transformers (MViT) introduce a hierarchical multiscale architecture for video and image recognition, outperforming existing transformers in accuracy and efficiency by leveraging multiscale feature hierarchies.
Contribution
The paper proposes a novel multiscale transformer architecture that hierarchically expands channel capacity while reducing spatial resolution, improving recognition performance without large external pre-training.
Findings
Outperforms concurrent vision transformers in video recognition tasks.
Achieves higher accuracy on image classification benchmarks.
Requires significantly less computation and fewer parameters.
Abstract
We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simple low-level visual information, and deeper layers at spatially coarse, but complex, high-dimensional features. We evaluate this fundamental architectural prior for modeling the dense nature of visual signals for a variety of video recognition tasks where it outperforms concurrent vision transformers that rely on large scale external pre-training and are 5-10x more costly in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/hiera_small_224.mae_in1k_ft_in1kmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗facebook/hiera_base_224.mae_in1k_ft_in1kmodel· 56 dl· ♡ 356 dl♡ 3
- 🤗facebook/hiera-tiny-224-hfmodel· 677 dl677 dl
- 🤗facebook/hiera-tiny-224-in1k-hfmodel· 313 dl· ♡ 2313 dl♡ 2
- 🤗facebook/hiera-tiny-224-mae-hfmodel· 517 dl· ♡ 1517 dl♡ 1
- 🤗facebook/hiera-small-224-mae-hfmodel· 1 dl1 dl
- 🤗facebook/hiera-small-224-hfmodel· 10 dl10 dl
- 🤗facebook/hiera-small-224-in1k-hfmodel· 7 dl7 dl
- 🤗facebook/hiera-base-224-in1k-hfmodel· 46 dl· ♡ 246 dl♡ 2
- 🤗facebook/hiera-base-224-hfmodel· 114 dl114 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Cell Image Analysis Techniques · Advanced Image and Video Retrieval Techniques
MethodsMultiscale Vision Transformer
