Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh, Hajishirzi

TL;DR
This paper introduces Infini-gram, a scalable unbounded n-gram language model trained on 5 trillion tokens, capable of arbitrarily large n, and demonstrates its effectiveness in text analysis and enhancing neural LLMs.
Contribution
The paper presents a novel infinity-gram model and an efficient suffix array-based engine that enables training and inference at unprecedented scale and n-gram length.
Findings
Infini-gram achieves 47% accuracy in next-token prediction.
It significantly reduces perplexity when combined with neural LLMs.
Analysis reveals irregularities in machine-generated text related to suffix length.
Abstract
Are -gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we showcase their values in both text analysis and improving neural LLMs. This was done by modernizing -gram LMs in two aspects. First, we train them at the same data scale as neural LLMs -- 5 trillion tokens. This is the largest -gram LM ever built. Second, existing -gram LMs use small which hinders their performance; we instead allow to be arbitrarily large, by introducing a new -gram LM with backoff. Instead of pre-computing -gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute -gram (as well as -gram with arbitrary ) probabilities with millisecond-level latency. The -gram framework and infini-gram engine enable us to conduct many…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling
