Scaling Retrieval-Based Language Models with a Trillion-Token Datastore
Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon, Min, Luke Zettlemoyer, Pang Wei Koh

TL;DR
This paper demonstrates that increasing the size of the datastore in retrieval-based language models consistently enhances performance across tasks, with a 1.4 trillion-token datastore outperforming traditional models, highlighting datastore scaling as a key factor.
Contribution
The authors introduce the largest open-source 1.4 trillion-token datastore and provide a comprehensive analysis of datastore scaling effects on language model performance.
Findings
Larger datastores improve language modeling and downstream tasks monotonically.
A smaller model with a large datastore outperforms bigger models without retrieval.
Datastore size significantly enhances performance for the same compute budget.
Abstract
Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
