Focal Self-attention for Local-Global Interactions in Vision   Transformers

Jianwei Yang; Chunyuan Li; Pengchuan Zhang; Xiyang Dai; Bin Xiao; Lu; Yuan; Jianfeng Gao

arXiv:2107.00641·cs.CV·July 2, 2021·267 cites

Focal Self-attention for Local-Global Interactions in Vision Transformers

Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu, Yuan, Jianfeng Gao

PDF

Open Access 3 Repos 1 Video

TL;DR

This paper introduces Focal Self-attention, a mechanism that efficiently captures local and global visual dependencies in vision transformers, leading to state-of-the-art results in image classification, object detection, and segmentation.

Contribution

The paper proposes Focal Self-attention and Focal Transformer models, which improve efficiency and accuracy by combining fine local and coarse global interactions in vision transformers.

Findings

01

Achieves 83.5% and 83.8% top-1 accuracy on ImageNet with moderate and large models.

02

Outperforms Swin Transformers on multiple object detection benchmarks.

03

Sets new state-of-the-art results on COCO and ADE20K datasets.

Abstract

Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers· youtube

Taxonomy

TopicsVisual Attention and Saliency Detection · Visual perception and processing mechanisms · Infrared Target Detection Methodologies

MethodsAttention Is All You Need · Linear Layer · Focal Transformers · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Layer Normalization · Dropout · Multi-Head Attention