Parallelizing Linear Transformers with the Delta Rule over Sequence   Length

Songlin Yang; Bailin Wang; Yu Zhang; Yikang Shen; Yoon Kim

arXiv:2406.06484·cs.LG·January 16, 2025·3 cites

Parallelizing Linear Transformers with the Delta Rule over Sequence Length

Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, Yoon Kim

PDF

Open Access 3 Repos 1 Models 1 Datasets

TL;DR

This paper introduces a hardware-efficient training algorithm for linear transformers with the delta rule, enabling large-scale language models that outperform existing linear-time baselines and hybrid models on key NLP tasks.

Contribution

It presents a novel, scalable training algorithm for DeltaNet, a more expressive linear transformer variant, facilitating efficient training on modern hardware for large language models.

Findings

01

Trained a 1.3B parameter DeltaNet on 100B tokens.

02

DeltaNet outperforms recent linear baselines like Mamba and GLA.

03

Hybrid models with DeltaNet layers improve over standard transformers.

Abstract

Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

🤗
zhixuan-lin/delta_net-760m-longcrawl64-48b
model· 66 dl
66 dl

Datasets

huaXiaKyrie/up
dataset· 19k dl
19k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsCellular Automata and Applications · Quantum Computing Algorithms and Architecture · Computability, Logic, AI Algorithms

MethodsSoftmax · Focus