Scaling Laws for Linear Complexity Language Models
Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong

TL;DR
This paper establishes scaling laws for linear complexity language models, demonstrating they scale similarly to transformers and excel in linguistic tasks, with extensive experiments on various architectures and benchmarks.
Contribution
It provides the first comprehensive analysis of the scalability of linear complexity language models, comparing three architectures and including LLaMA as a baseline.
Findings
Linear models exhibit similar scaling behavior to transformers.
Linear models outperform in linguistic tasks and knowledge retention.
Scaling laws are established for three efficient linear architectures.
Abstract
The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsNatural Language Processing Techniques
MethodsAttention Is All You Need · Softmax · LLaMA
