Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang; Bailin Wang; Yikang Shen; Rameswar Panda; Yoon Kim

arXiv:2312.06635·cs.LG·August 28, 2024·5 cites

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, Yoon Kim

PDF

Open Access 5 Repos 5 Models 1 Datasets

TL;DR

This paper introduces a hardware-efficient linear attention algorithm and a gated linear attention Transformer that outperform existing methods in speed and length generalization for language modeling.

Contribution

It presents FLASHLINEARATTENTION, a faster linear attention implementation, and a GLA Transformer with data-dependent gates that excels in length generalization and training throughput.

Findings

01

FLASHLINEARATTENTION outperforms FLASHATTENTION-2 even on short sequences.

02

GLA Transformer matches performance of LLaMA and recent linear models.

03

GLA Transformer generalizes well to sequences over 20K tokens.

Abstract

Transformers with linear attention allow for efficient parallel training but can simultaneously be formulated as an RNN with 2D (matrix-valued) hidden states, thus enjoying linear-time inference complexity. However, linear attention generally underperforms ordinary softmax attention. Moreover, current implementations of linear attention lack I/O-awareness and are thus slower than highly optimized implementations of softmax attention. This work describes a hardware-efficient algorithm for linear attention that trades off memory movement against parallelizability. The resulting implementation, dubbed FLASHLINEARATTENTION, is faster than FLASHATTENTION-2 (Dao, 2023) as a standalone layer even on short sequence lengths (e.g., 1K). We then generalize this algorithm to a more expressive variant of linear attention with data-dependent gates. When used as a replacement for the standard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

huaXiaKyrie/up
dataset· 19k dl
19k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNeural Networks and Applications · Advanced Memory and Neural Computing · CCD and CMOS Imaging Sensors

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Layer Normalization · Dropout · Dense Connections · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing