Rethinking Query-Key Pairwise Interactions in Vision Transformers

Cheng Li; Yangxin Liu

arXiv:2207.00188·cs.CV·July 5, 2022

Rethinking Query-Key Pairwise Interactions in Vision Transformers

Cheng Li, Yangxin Liu

PDF

Open Access

TL;DR

This paper introduces key-only attention in Vision Transformers, reducing computational complexity and enhancing local-global interaction modeling, leading to state-of-the-art results on ImageNet, COCO, and ADE20K benchmarks.

Contribution

It proposes a novel key-only attention mechanism with linear complexity and a hybrid convolution-attention layout, improving efficiency and performance over existing methods.

Findings

01

Achieves state-of-the-art accuracy on ImageNet with parameter-limited models

02

Outperforms baselines in COCO object detection

03

Improves semantic segmentation results on ADE20K

Abstract

Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict the receptive field to a small local region. To overcome these limitations, we propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights, modeling local-global interactions in all stages. Key-only attention has linear computational and memory complexities w.r.t input size. We use alternate layout to hybridize convolution and attention layers instead of grafting which is suggested by previous works, so that all stages can benefit from both spatial attentions and convolutions. We leverage these improvements to develop a new self-attention model family, LinGlos,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Visual Attention and Saliency Detection · Domain Adaptation and Few-Shot Learning

MethodsConvolution