A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies
Hongyu H\`e, Marko Kabic

TL;DR
This paper reviews and unifies existing long-sequence modeling methods, demonstrates the importance of long context, and proposes a scalable multi-head attention system capable of handling million-scale dependencies efficiently.
Contribution
It provides a unified mathematical framework for long-sequence models, benchmarks their performance, and introduces a scalable attention algorithm for million-scale dependencies.
Findings
Long context length improves performance in many tasks.
Traditional Transformers struggle with long-range dependencies.
The proposed distributed multi-head attention scales efficiently on GPUs.
Abstract
Ever since their conception, Transformers have taken over traditional sequence models in many tasks, such as NLP, image classification, and video/audio processing, for their fast training and superior performance. Much of the merit is attributable to positional encoding and multi-head attention. However, Transformers fall short in learning long-range dependencies mainly due to the quadratic complexity scaled with context length, in terms of both time and space. Consequently, over the past five years, a myriad of methods has been proposed to make Transformers more efficient. In this work, we first take a step back, study and compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation. Specifically, we summarize them using a unified template, given their shared nature of token mixing. Through benchmarks, we then demonstrate that long context length…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Domain Adaptation and Few-Shot Learning · Machine Learning and Data Classification
MethodsMulti-Head Attention · Attention Is All You Need · Absolute Position Encodings · Dropout · Layer Normalization · Dense Connections · Position-Wise Feed-Forward Layer · Adam · Softmax · Label Smoothing
