Transformer Quality in Linear Time

Weizhe Hua; Zihang Dai; Hanxiao Liu; Quoc V. Le

arXiv:2202.10447·cs.LG·June 28, 2022·45 cites

Transformer Quality in Linear Time

Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc V. Le

PDF

Open Access 2 Repos

TL;DR

This paper introduces FLASH, a new Transformer variant that maintains high quality with linear time complexity, enabling faster training on long sequences without significant performance loss.

Contribution

The paper presents a novel gated attention unit and a linear approximation method, significantly improving Transformer efficiency for long sequence processing.

Findings

01

FLASH matches the perplexity of improved Transformers on various datasets.

02

Achieves up to 12.1× training speedup on Wiki-40B.

03

Maintains high quality with linear complexity for long sequences.

Abstract

We revisit the design choices in Transformers, and propose methods to address their weaknesses in handling long sequences. First, we propose a simple layer named gated attention unit, which allows the use of a weaker single-head attention with minimal quality loss. We then propose a linear approximation method complementary to this new layer, which is accelerator-friendly and highly competitive in quality. The resulting model, named FLASH, matches the perplexity of improved Transformers over both short (512) and long (8K) context lengths, achieving training speedups of up to 4.9 $\times$ on Wiki-40B and 12.1 $\times$ on PG-19 for auto-regressive language modeling, and 4.8 $\times$ on C4 for masked language modeling.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications