Focal Self-attention for Local-Global Interactions in Vision Transformers
Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Xiyang Dai, Bin Xiao, Lu, Yuan, Jianfeng Gao

TL;DR
This paper introduces Focal Self-attention, a mechanism that efficiently captures local and global visual dependencies in vision transformers, leading to state-of-the-art results in image classification, object detection, and segmentation.
Contribution
The paper proposes Focal Self-attention and Focal Transformer models, which improve efficiency and accuracy by combining fine local and coarse global interactions in vision transformers.
Findings
Achieves 83.5% and 83.8% top-1 accuracy on ImageNet with moderate and large models.
Outperforms Swin Transformers on multiple object detection benchmarks.
Sets new state-of-the-art results on COCO and ADE20K datasets.
Abstract
Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Focal Transformer: Focal Self-attention for Local-Global Interactions in Vision Transformers· youtube
Taxonomy
TopicsVisual Attention and Saliency Detection · Visual perception and processing mechanisms · Infrared Target Detection Methodologies
MethodsAttention Is All You Need · Linear Layer · Focal Transformers · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Byte Pair Encoding · Layer Normalization · Dropout · Multi-Head Attention
