VOLO: Vision Outlooker for Visual Recognition
Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan

TL;DR
VOLO introduces outlook attention to improve fine-level feature encoding in vision transformers, achieving state-of-the-art accuracy on ImageNet and strong transfer performance without extra data.
Contribution
The paper proposes a novel outlook attention mechanism and a simple architecture, VOLO, which enhances fine-level feature encoding in vision transformers, surpassing previous models.
Findings
Achieves 87.1% top-1 accuracy on ImageNet-1K.
Outperforms previous models without extra training data.
Transfers effectively to semantic segmentation tasks.
Abstract
Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- rwightman/pytorch-image-modelspytorchOfficial
- sail-sg/volopytorchOfficial
- xmu-xiaoma666/External-Attention-pytorch/blob/master/attention/OutlookAttention.pypytorch
- leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/volotf
- mindspore-courses/External-Attention-MindSpore/blob/main/model/backbone/VOLO.pymindspore
- 🤗kadirnar/timm_model_listmodel· ♡ 1♡ 1
- 🤗timm/volo_d1_224.sail_in1kmodel· 3.1k dl· ♡ 23.1k dl♡ 2
- 🤗timm/volo_d1_384.sail_in1kmodel· 35 dl35 dl
- 🤗timm/volo_d2_224.sail_in1kmodel· 51 dl51 dl
- 🤗timm/volo_d2_384.sail_in1kmodel· 79 dl79 dl
- 🤗timm/volo_d3_224.sail_in1kmodel· 43 dl43 dl
- 🤗timm/volo_d3_448.sail_in1kmodel· 39 dl39 dl
- 🤗timm/volo_d4_224.sail_in1kmodel· 40 dl40 dl
- 🤗timm/volo_d4_448.sail_in1kmodel· 40 dl40 dl
- 🤗timm/volo_d5_224.sail_in1kmodel· 130 dl130 dl
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
