RegionViT: Regional-to-Local Attention for Vision Transformers
Chun-Fu Chen, Rameswar Panda, Quanfu Fan

TL;DR
RegionViT introduces a regional-to-local attention mechanism within a pyramid structure for vision transformers, effectively capturing both global and local information, leading to improved performance across multiple vision tasks.
Contribution
The paper proposes a novel regional-to-local attention mechanism in a pyramid vision transformer architecture, enhancing global and local feature integration for better vision task performance.
Findings
Outperforms or matches state-of-the-art ViT variants on multiple tasks
Effective regional-to-local attention captures global and local information
Demonstrates versatility across classification, detection, segmentation, and recognition
Abstract
Vision transformer (ViT) has recently shown its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Visual Attention and Saliency Detection
MethodsRegionViT
