Scaling Laws for Linear Complexity Language Models

Xuyang Shen; Dong Li; Ruitao Leng; Zhen Qin; Weigao Sun; Yiran Zhong

arXiv:2406.16690·cs.CL·June 25, 2024

Scaling Laws for Linear Complexity Language Models

Xuyang Shen, Dong Li, Ruitao Leng, Zhen Qin, Weigao Sun, Yiran Zhong

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper establishes scaling laws for linear complexity language models, demonstrating they scale similarly to transformers and excel in linguistic tasks, with extensive experiments on various architectures and benchmarks.

Contribution

It provides the first comprehensive analysis of the scalability of linear complexity language models, comparing three architectures and including LLaMA as a baseline.

Findings

01

Linear models exhibit similar scaling behavior to transformers.

02

Linear models outperform in linguistic tasks and knowledge retention.

03

Scaling laws are established for three efficient linear architectures.

Abstract

The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

opennlplab/scalinglaws
noneOfficial

Videos

Scaling Laws for Linear Complexity Language Models· underline

Taxonomy

TopicsNatural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · LLaMA