Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion   Tokens

Jiacheng Liu; Sewon Min; Luke Zettlemoyer; Yejin Choi; Hannaneh; Hajishirzi

arXiv:2401.17377·cs.CL·April 8, 2025·1 cites

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh, Hajishirzi

PDF

Open Access 2 Repos

TL;DR

This paper introduces Infini-gram, a scalable unbounded n-gram language model trained on 5 trillion tokens, capable of arbitrarily large n, and demonstrates its effectiveness in text analysis and enhancing neural LLMs.

Contribution

The paper presents a novel infinity-gram model and an efficient suffix array-based engine that enables training and inference at unprecedented scale and n-gram length.

Findings

01

Infini-gram achieves 47% accuracy in next-token prediction.

02

It significantly reduces perplexity when combined with neural LLMs.

03

Analysis reveals irregularities in machine-generated text related to suffix length.

Abstract

Are $n$ -gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing $n$ -gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest $n$ -gram LM ever built. Second, existing $n$ -gram LMs use small $n$ which hinders their performance; we instead allow $n$ to be arbitrarily large, by introducing a new $\infty$ -gram LM with backoff. Instead of pre-computing $n$ -gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute $\infty$ -gram (as well as $n$ -gram with arbitrary $n$ ) probabilities with millisecond-level latency. The $\infty$ -gram framework and infini-gram engine enable us to conduct many…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling