VOLO: Vision Outlooker for Visual Recognition

Li Yuan; Qibin Hou; Zihang Jiang; Jiashi Feng; Shuicheng Yan

arXiv:2106.13112·cs.CV·June 29, 2021·23 cites

VOLO: Vision Outlooker for Visual Recognition

Li Yuan, Qibin Hou, Zihang Jiang, Jiashi Feng, Shuicheng Yan

PDF

Open Access 5 Repos 10 Models

TL;DR

VOLO introduces outlook attention to improve fine-level feature encoding in vision transformers, achieving state-of-the-art accuracy on ImageNet and strong transfer performance without extra data.

Contribution

The paper proposes a novel outlook attention mechanism and a simple architecture, VOLO, which enhances fine-level feature encoding in vision transformers, surpassing previous models.

Findings

01

Achieves 87.1% top-1 accuracy on ImageNet-1K.

02

Outperforms previous models without extra training data.

03

Transfers effectively to semantic segmentation tasks.

Abstract

Visual recognition has been dominated by convolutional neural networks (CNNs) for years. Though recently the prevailing vision transformers (ViTs) have shown great potential of self-attention based models in ImageNet classification, their performance is still inferior to that of the latest SOTA CNNs if no extra data are provided. In this work, we try to close the performance gap and demonstrate that attention-based models are indeed able to outperform CNNs. We find a major factor limiting the performance of ViTs for ImageNet classification is their low efficacy in encoding fine-level features into the token representations. To resolve this, we introduce a novel outlook attention and present a simple and general architecture, termed Vision Outlooker (VOLO). Unlike self-attention that focuses on global dependency modeling at a coarse level, the outlook attention efficiently encodes…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques