Gated Linear Attention Transformers with Hardware-Efficient Training
Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim

TL;DR
This paper introduces a hardware-efficient linear attention algorithm and a gated linear attention Transformer that outperform existing methods in speed and length generalization for language modeling.
Contribution
It presents FLASHLINEARATTENTION, a faster linear attention implementation, and a GLA Transformer with data-dependent gates that excels in length generalization and training throughput.
Findings
FLASHLINEARATTENTION outperforms FLASHATTENTION-2 even on short sequences.
GLA Transformer matches performance of LLaMA and recent linear models.
GLA Transformer generalizes well to sequences over 20K tokens.
Abstract
Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a standalone layer even on short sequence lengths (e.g., 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNeural Networks and Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
