RADLADS: Rapid Attention Distillation to Linear Attention Decoders at Scale
Daniel Goldstein, Eric Alcaide, Janna Lu, Eugene Cheah

TL;DR
RADLADS introduces a rapid, cost-effective method to convert softmax attention transformers into linear attention models, achieving high performance with minimal training data and resources, and releasing models for broad use.
Contribution
The paper presents a new protocol for fast conversion of transformers to linear attention models, along with novel RWKV-variant architectures and publicly available models.
Findings
Models achieve state-of-the-art performance for their size.
Conversion requires only a small fraction of original training tokens.
Cost of converting to 72B model is under $2,000 USD.
Abstract
We present Rapid Attention Distillation to Linear Attention Decoders at Scale (RADLADS), a protocol for rapidly converting softmax attention transformers into linear attention decoder models, along with two new RWKV-variant architectures, and models converted from popular Qwen2.5 open source models in 7B, 32B, and 72B sizes. Our conversion process requires only 350-700M tokens, less than 0.005% of the token count used to train the original teacher models. Converting to our 72B linear attention model costs less than $2,000 USD at today's prices, yet quality at inference remains close to the original transformer. These models achieve state-of-the-art downstream performance across a set of standard benchmarks for linear attention models of their size. We release all our models on HuggingFace under the Apache 2.0 license, with the exception of our 72B models which are also governed by the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗featherless-ai/QRWKV-72Bmodel· 126 dl· ♡ 67126 dl♡ 67
- 🤗featherless-ai/QRWKV-QwQ-32Bmodel· 22 dl· ♡ 3022 dl♡ 30
- 🤗recursal/radlads-7b-variousmodel
- 🤗recursal/QRWKV6-7B-Instructmodel· 11 dl· ♡ 111 dl♡ 1
- 🤗recursal/QRWKV6-7B-Basemodel· 16 dl· ♡ 216 dl♡ 2
- 🤗recursal/QRWKV7-7B-Instructmodel· 9 dl· ♡ 79 dl♡ 7
- 🤗OpenMOSE/HRWKV7-Reka-Flash3-Previewmodel· ♡ 1♡ 1
- 🤗OpenMOSE/HRWKV7-Reka-Flash3.1-Previewmodel· ♡ 1♡ 1
- 🤗OpenMOSE/HRWKV7-hxa079-Qwen3-8Bmodel
- 🤗OpenMOSE/RWKV-Seed-OSS-36B-hxa07Amodel· 4 dl· ♡ 24 dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Big Data and Digital Economy
MethodsAttention Is All You Need · Sparse Evolutionary Training · Softmax
