H-Transformer-1D: Fast One-Dimensional Hierarchical Attention for Sequences
Zhenhai Zhu, Radu Soricut

TL;DR
H-Transformer-1D introduces a hierarchical attention mechanism inspired by H-Matrices, achieving linear complexity and superior performance on sequence tasks like language modeling and vision, with fewer parameters.
Contribution
The paper proposes a novel hierarchical attention method for Transformers that reduces complexity to linear time and memory, improving efficiency and performance.
Findings
Achieves over +6 points on Long Range Arena benchmark
Sets new SOTA perplexity on One-Billion Word dataset
Uses 5x fewer parameters than previous models
Abstract
We describe an efficient hierarchical method to compute attention in the Transformer architecture. The proposed attention mechanism exploits a matrix structure similar to the Hierarchical Matrix (H-Matrix) developed by the numerical analysis community, and has linear run time and memory complexity. We perform extensive experiments to show that the inductive bias embodied by our hierarchical attention is effective in capturing the hierarchical structure in the sequences typical for natural language and vision tasks. Our method is superior to alternative sub-quadratic proposals by over +6 points on average on the Long Range Arena benchmark. It also sets a new SOTA test perplexity on One-Billion Word dataset with 5x fewer model parameters than that of the previous-best Transformer-based models.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Label Smoothing · Residual Connection · Layer Normalization · Dense Connections · Byte Pair Encoding · Softmax
