RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Daniel Goldstein; Eric Alcaide; Janna Lu; Eugene Cheah

arXiv:2505.03005·cs.CL·January 23, 2026

RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale

Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

PDF

Open Access 1 Repo 10 Models 1 Datasets

TL;DR

RADLADS introduces a rapid, cost-effective method to convert softmax attention transformers into linear attention models, achieving high performance with minimal training data and resources, and releasing models for broad use.

Contribution

The paper presents a new protocol for fast conversion of transformers to linear attention models, along with novel RWKV-variant architectures and publicly available models.

Findings

01

Models achieve state-of-the-art performance for their size.

02

Conversion requires only a small fraction of original training tokens.

03

Cost of converting to 72B model is under $2,000 USD.

Abstract

We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

recursal/radlads-paper
pytorchOfficial

Models

Datasets

recursal/DCLM-10B-Qwen2-binidx
dataset· 106 dl
106 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Big Data and Digital Economy

MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax