Linearizing Large Language Models
Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave,, Adrien Gaidon, Thomas Kollar

TL;DR
This paper introduces SUPRA, a cost-effective method to convert large pre-trained transformers into linear RNNs, enabling efficient training with minimal compute while maintaining competitive performance on benchmarks.
Contribution
The paper proposes SUPRA, a novel approach to linearize large pre-trained transformers into RNNs, reducing training costs by 95% and leveraging existing models' strengths.
Findings
Linear models perform competitively on benchmarks.
Persistent issues in in-context learning and long-context modeling.
SUPRA enables efficient linearization of large transformers.
Abstract
Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsSoftmax
