Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles
Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan,, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy, Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

TL;DR
Hiera is a simple, fast, and accurate hierarchical vision transformer that removes unnecessary components from complex models by leveraging strong pretraining, achieving superior performance with less complexity.
Contribution
The paper introduces Hiera, a streamlined hierarchical vision transformer that maintains high accuracy while significantly reducing complexity and computational costs, unlike previous models.
Findings
Hiera outperforms previous models in accuracy.
Hiera is faster at inference and training.
Hiera's simplicity does not compromise performance.
Abstract
Modern hierarchical vision transformers have added several vision-specific components in the pursuit of supervised classification performance. While these components lead to effective accuracies and attractive FLOP counts, the added complexity actually makes these transformers slower than their vanilla ViT counterparts. In this paper, we argue that this additional bulk is unnecessary. By pretraining with a strong visual pretext task (MAE), we can strip out all the bells-and-whistles from a state-of-the-art multi-stage vision transformer without losing accuracy. In the process, we create Hiera, an extremely simple hierarchical vision transformer that is more accurate than previous models while being significantly faster both at inference and during training. We evaluate Hiera on a variety of tasks for image and video recognition. Our code and models are available at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗facebook/hiera_small_224.mae_in1k_ft_in1kmodel· 12 dl· ♡ 112 dl♡ 1
- 🤗facebook/hiera_base_224.mae_in1k_ft_in1kmodel· 56 dl· ♡ 356 dl♡ 3
- 🤗facebook/hiera-tiny-224-hfmodel· 677 dl677 dl
- 🤗facebook/hiera-tiny-224-in1k-hfmodel· 313 dl· ♡ 2313 dl♡ 2
- 🤗facebook/hiera-tiny-224-mae-hfmodel· 517 dl· ♡ 1517 dl♡ 1
- 🤗facebook/hiera-small-224-mae-hfmodel· 1 dl1 dl
- 🤗facebook/hiera-small-224-hfmodel· 10 dl10 dl
- 🤗facebook/hiera-small-224-in1k-hfmodel· 7 dl7 dl
- 🤗facebook/hiera-base-224-in1k-hfmodel· 46 dl· ♡ 246 dl♡ 2
- 🤗facebook/hiera-base-224-hfmodel· 114 dl114 dl
Videos
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing · Advanced Neural Network Applications
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Residual Connection · Linear Layer · Layer Normalization · Softmax · Vision Transformer
