Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Rulin Shao; Jacqueline He; Akari Asai; Weijia Shi; Tim Dettmers; Sewon; Min; Luke Zettlemoyer; Pang Wei Koh

arXiv:2407.12854·cs.CL·July 19, 2024

Scaling Retrieval-Based Language Models with a Trillion-Token Datastore

Rulin Shao, Jacqueline He, Akari Asai, Weijia Shi, Tim Dettmers, Sewon, Min, Luke Zettlemoyer, Pang Wei Koh

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper demonstrates that increasing the size of the datastore in retrieval-based language models consistently enhances performance across tasks, with a 1.4 trillion-token datastore outperforming traditional models, highlighting datastore scaling as a key factor.

Contribution

The authors introduce the largest open-source 1.4 trillion-token datastore and provide a comprehensive analysis of datastore scaling effects on language model performance.

Findings

01

Larger datastores improve language modeling and downstream tasks monotonically.

02

A smaller model with a large datastore outperforms bigger models without retrieval.

03

Datastore size significantly enhances performance for the same compute budget.

Abstract

Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining language models (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rulinshao/retrieval-scaling
noneOfficial

Datasets

rulins/MassiveDS-1.4T
dataset· 162 dl
162 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques