Linearizing Large Language Models

Jean Mercat; Igor Vasiljevic; Sedrick Keh; Kushal Arora; Achal Dave,; Adrien Gaidon; Thomas Kollar

arXiv:2405.06640·cs.CL·May 13, 2024

Linearizing Large Language Models

Jean Mercat, Igor Vasiljevic, Sedrick Keh, Kushal Arora, Achal Dave,, Adrien Gaidon, Thomas Kollar

PDF

Open Access 1 Repo 3 Models

TL;DR

This paper introduces SUPRA, a cost-effective method to convert large pre-trained transformers into linear RNNs, enabling efficient training with minimal compute while maintaining competitive performance on benchmarks.

Contribution

The paper proposes SUPRA, a novel approach to linearize large pre-trained transformers into RNNs, reducing training costs by 95% and leveraging existing models' strengths.

Findings

01

Linear models perform competitively on benchmarks.

02

Persistent issues in in-context learning and long-context modeling.

03

SUPRA enables efficient linearization of large transformers.

Abstract

Linear transformers have emerged as a subquadratic-time alternative to softmax attention and have garnered significant interest due to their fixed-size recurrent state that lowers inference cost. However, their original formulation suffers from poor scaling and underperforms compute-matched transformers. Recent linear models such as RWKV and Mamba have attempted to address these shortcomings by proposing novel time-mixing and gating architectures, but pre-training large language models requires significant data and compute investments. Thus, the search for subquadratic architectures is limited by the availability of compute and quality pre-training datasets. As a cost-effective alternative to pre-training linear transformers, we propose Scalable UPtraining for Recurrent Attention (SUPRA). We present a method to uptrain existing large pre-trained transformers into Recurrent Neural…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tri-ml/linear_open_lm
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsSoftmax