Learned Queries for Efficient Local Attention
Moab Arar, Ariel Shamir, Amit H. Bermano

TL;DR
This paper introduces QnA, a learned query local attention layer for vision transformers that improves speed and memory efficiency while maintaining accuracy, especially with larger window sizes.
Contribution
The paper proposes a novel shift-invariant local attention layer with learned queries, enhancing efficiency and scalability in hierarchical vision transformers.
Findings
QnA reduces memory usage by up to 10 times.
QnA is up to 5 times faster than existing methods.
QnA achieves comparable accuracy to state-of-the-art models.
Abstract
Vision Transformers (ViT) serve as powerful vision models. Unlike convolutional neural networks, which dominated vision research in previous years, vision transformers enjoy the ability to capture long-range dependencies in the data. Nonetheless, an integral part of any transformer architecture, the self-attention mechanism, suffers from high latency and inefficient memory utilization, making it less suitable for high-resolution input images. To alleviate these shortcomings, hierarchical vision models locally employ self-attention on non-interleaving windows. This relaxation reduces the complexity to be linear in the input size; however, it limits the cross-window interaction, hurting the model performance. In this paper, we propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner, much like…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings · Softmax · Layer Normalization · Residual Connection · Dense Connections · Vision Transformer · High-resolution input
