Various Lengths, Constant Speed: Efficient Language Modeling with   Lightning Attention

Zhen Qin; Weigao Sun; Dong Li; Xuyang Shen; Weixuan Sun; Yiran Zhong

arXiv:2405.17381·cs.CL·June 21, 2024

Various Lengths, Constant Speed: Efficient Language Modeling with Lightning Attention

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, Yiran Zhong

PDF

Open Access 1 Repo

TL;DR

Lightning Attention introduces a novel linear attention method that maintains constant training speed across various sequence lengths by splitting attention calculations and optimizing GPU utilization, enabling efficient and accurate language modeling.

Contribution

The paper presents Lightning Attention, a new linear attention approach that overcomes cumsum issues and achieves fixed-speed training across sequence lengths, along with a tailored architecture TNL.

Findings

01

Lightning Attention maintains constant training speed for different sequence lengths.

02

TNL outperforms other models in efficiency and matches state-of-the-art performance.

03

The source code is publicly available for reproducibility.

Abstract

We present Lightning Attention, the first linear attention implementation that maintains a constant training speed for various sequence lengths under fixed memory consumption. Due to the issue with cumulative summation operations (cumsum), previous linear attention implementations cannot achieve their theoretical advantage in a casual setting. However, this issue can be effectively solved by utilizing different attention calculation strategies to compute the different parts of attention. Specifically, we split the attention calculation into intra-blocks and inter-blocks and use conventional attention computation for intra-blocks and linear attention kernel tricks for inter-blocks. This eliminates the need for cumsum in the linear attention calculation. Furthermore, a tiling technique is adopted through both forward and backward procedures to take full advantage of the GPU hardware. To…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opennlplab/transnormerllm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings