Rethinking Query-Key Pairwise Interactions in Vision Transformers
Cheng Li, Yangxin Liu

TL;DR
This paper introduces key-only attention in Vision Transformers, reducing computational complexity and enhancing local-global interaction modeling, leading to state-of-the-art results on ImageNet, COCO, and ADE20K benchmarks.
Contribution
It proposes a novel key-only attention mechanism with linear complexity and a hybrid convolution-attention layout, improving efficiency and performance over existing methods.
Findings
Achieves state-of-the-art accuracy on ImageNet with parameter-limited models
Outperforms baselines in COCO object detection
Improves semantic segmentation results on ADE20K
Abstract
Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict the receptive field to a small local region. To overcome these limitations, we propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights, modeling local-global interactions in all stages. Key-only attention has linear computational and memory complexities w.r.t input size. We use alternate layout to hybridize convolution and attention layers instead of grafting which is suggested by previous works, so that all stages can benefit from both spatial attentions and convolutions. We leverage these improvements to develop a new self-attention model family, LinGlos,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning
MethodsConvolution
