RegionViT: Regional-to-Local Attention for Vision Transformers

Chun-Fu Chen; Rameswar Panda; Quanfu Fan

arXiv:2106.02689·cs.CV·April 1, 2022·94 cites

RegionViT: Regional-to-Local Attention for Vision Transformers

Chun-Fu Chen, Rameswar Panda, Quanfu Fan

PDF

Open Access 4 Repos 1 Video

TL;DR

RegionViT introduces a regional-to-local attention mechanism within a pyramid structure for vision transformers, effectively capturing both global and local information, leading to improved performance across multiple vision tasks.

Contribution

The paper proposes a novel regional-to-local attention mechanism in a pyramid vision transformer architecture, enhancing global and local feature integration for better vision task performance.

Findings

01

Outperforms or matches state-of-the-art ViT variants on multiple tasks

02

Effective regional-to-local attention captures global and local information

03

Demonstrates versatility across classification, detection, segmentation, and recognition

Abstract

Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

RegionViT: Regional-to-Local Attention for Vision Transformers· slideslive

Taxonomy

TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection

MethodsRegionViT