Transformer Quality in Linear Time
Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

TL;DR
This paper introduces FLASH, a new Transformer variant that maintains high quality with linear time complexity, enabling faster training on long sequences without significant performance loss.
Contribution
The paper presents a novel gated attention unit and a linear approximation method, significantly improving Transformer efficiency for long sequence processing.
Findings
FLASH matches the perplexity of improved Transformers on various datasets.
Achieves up to 12.1× training speedup on Wiki-40B.
Maintains high quality with linear complexity for long sequences.
Abstract
We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9 on Wiki-40B and 12.1 on PG-19 for auto-regressive language modeling, and 4.8 on C4 for masked language modeling.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
