TL;DR
This paper investigates how to optimally balance pretraining data and retrieval store size in retrieval-augmented language models, providing a scaling framework that guides resource allocation for improved performance.
Contribution
It introduces a three-dimensional scaling framework modeling performance as a function of model size, pretraining tokens, and retrieval store size, guiding data resource allocation.
Findings
Retrieval consistently improves performance across model scales.
A scaling manifold models performance based on size, data, and retrieval.
Optimal data allocation depends on model scale, task, and pretraining saturation.
Abstract
Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
