Learned Queries for Efficient Local Attention

Moab Arar; Ariel Shamir; Amit H. Bermano

arXiv:2112.11435·cs.CV·April 20, 2022

Learned Queries for Efficient Local Attention

Moab Arar, Ariel Shamir, Amit H. Bermano

PDF

Open Access 1 Repo

TL;DR

This paper introduces QnA, a learned query local attention layer for vision transformers that improves speed and memory efficiency while maintaining accuracy, especially with larger window sizes.

Contribution

The paper proposes a novel shift-invariant local attention layer with learned queries, enhancing efficiency and scalability in hierarchical vision transformers.

Findings

01

QnA reduces memory usage by up to 10 times.

02

QnA is up to 5 times faster than existing methods.

03

QnA achieves comparable accuracy to state-of-the-art models.

Abstract

Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

moabarar/qna
jaxOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Layer Normalization · Residual Connection · Dense Connections · Vision Transformer · High-resolution input