H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for   Sequences

Zhenhai Zhu; Radu Soricut

arXiv:2107.11906·cs.LG·July 27, 2021·1 cites

H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences

Zhenhai Zhu, Radu Soricut

PDF

Open Access 2 Repos

TL;DR

H-Transformer-1D introduces a hierarchical attention mechanism inspired by H-Matrices, achieving linear complexity and superior performance on sequence tasks like language modeling and vision, with fewer parameters.

Contribution

The paper proposes a novel hierarchical attention method for Transformers that reduces complexity to linear time and memory, improving efficiency and performance.

Findings

01

Achieves over +6 points on Long Range Arena benchmark

02

Sets new SOTA perplexity on One-Billion Word dataset

03

Uses 5x fewer parameters than previous models

Abstract

We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Layer Normalization · Dense Connections · Byte Pair Encoding · Softmax