Adaptive Multi-Resolution Attention with Linear Complexity

Yao Zhang; Yunpu Ma; Thomas Seidl; Volker Tresp

arXiv:2108.04962·cs.LG·August 12, 2021

Adaptive Multi-Resolution Attention with Linear Complexity

Yao Zhang, Yunpu Ma, Thomas Seidl, Volker Tresp

PDF

Open Access

TL;DR

The paper introduces AdaMRA, an efficient multi-resolution attention mechanism for Transformers that achieves linear complexity and improves long-range information capture, leading to state-of-the-art results.

Contribution

It proposes a novel multi-resolution attention structure with query-driven resolution selection and kernel attention for linear complexity in Transformers.

Findings

01

Achieves state-of-the-art performance on multiple benchmarks.

02

Demonstrates significant efficiency and memory improvements.

03

Maintains performance with reduced computational complexity.

Abstract

Transformers have improved the state-of-the-art across numerous tasks in sequence modeling. Besides the quadratic computational and memory complexity w.r.t the sequence length, the self-attention mechanism only processes information at the same scale, i.e., all attention heads are in the same resolution, resulting in the limited power of the Transformer. To remedy this, we propose a novel and efficient structure named Adaptive Multi-Resolution Attention (AdaMRA for short), which scales linearly to sequence length in terms of time and space. Specifically, we leverage a multi-resolution multi-head attention mechanism, enabling attention heads to capture long-range contextual information in a coarse-to-fine fashion. Moreover, to capture the potential relations between query representation and clues of different attention granularities, we leave the decision of which resolution of attention…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Scientific Computing and Data Management · Machine Learning and Data Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Dense Connections · Layer Normalization · Byte Pair Encoding · Label Smoothing · Residual Connection