Parallelizing Linear Transformers with the Delta Rule over Sequence Length
Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim

TL;DR
This paper introduces a hardware-efficient training algorithm for linear transformers with the delta rule, enabling large-scale language models that outperform existing linear-time baselines and hybrid models on key NLP tasks.
Contribution
It presents a novel, scalable training algorithm for DeltaNet, a more expressive linear transformer variant, facilitating efficient training on modern hardware for large language models.
Findings
Trained a 1.3B parameter DeltaNet on 100B tokens.
DeltaNet outperforms recent linear baselines like Mamba and GLA.
Hybrid models with DeltaNet layers improve over standard transformers.
Abstract
Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCellular Automata and Applications · Quantum Computing Algorithms and Architecture · Computability, Logic, AI Algorithms
MethodsSoftmax · Focus
