Adaptive Multi-Resolution Attention with Linear Complexity
Yao Zhang, Yunpu Ma, Thomas Seidl, Volker Tresp

TL;DR
The paper introduces AdaMRA, an efficient multi-resolution attention mechanism for Transformers that achieves linear complexity and improves long-range information capture, leading to state-of-the-art results.
Contribution
It proposes a novel multi-resolution attention structure with query-driven resolution selection and kernel attention for linear complexity in Transformers.
Findings
Achieves state-of-the-art performance on multiple benchmarks.
Demonstrates significant efficiency and memory improvements.
Maintains performance with reduced computational complexity.
Abstract
Transformers have improved the state-of-the-art across numerous tasks in sequence modeling. Besides the quadratic computational and memory complexity w.r.t the sequence length, the self-attention mechanism only processes information at the same scale, i.e., all attention heads are in the same resolution, resulting in the limited power of the Transformer. To remedy this, we propose a novel and efficient structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which scales linearly to sequence length in terms of time and space. Specifically, we leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion. Moreover, to capture the potential relations between query representation and clues of different attention granularities, we leave the decision of which resolution of attention…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Scientific Computing and Data Management · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Residual Connection
