To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Karan Singh; Michael Yu; Varun Gangal; Zhuofu Tao; Sachin Kumar; Emmy Liu; Steven Y. Feng

arXiv:2604.00715·cs.CL·April 2, 2026

To Memorize or to Retrieve: Scaling Laws for RAG-Considerate Pretraining

Karan Singh, Michael Yu, Varun Gangal, Zhuofu Tao, Sachin Kumar, Emmy Liu, Steven Y. Feng

PDF

1 Repo

TL;DR

This paper investigates how to optimally balance pretraining data and retrieval store size in retrieval-augmented language models, providing a scaling framework that guides resource allocation for improved performance.

Contribution

It introduces a three-dimensional scaling framework modeling performance as a function of model size, pretraining tokens, and retrieval store size, guiding data resource allocation.

Findings

01

Retrieval consistently improves performance across model scales.

02

A scaling manifold models performance based on size, data, and retrieval.

03

Optimal data allocation depends on model scale, task, and pretraining saturation.

Abstract

Retrieval-augmented generation (RAG) improves language model (LM) performance by providing relevant context at test time for knowledge-intensive situations. However, the relationship between parametric knowledge acquired during pretraining and non-parametric knowledge accessed via retrieval remains poorly understood, especially under fixed data budgets. In this work, we systematically study the trade-off between pretraining corpus size and retrieval store size across a wide range of model and data scales. We train OLMo-2-based LMs ranging from 30M to 3B parameters on up to 100B tokens of DCLM data, while varying both pretraining data scale (1-150x the number of parameters) and retrieval store size (1-20x), and evaluate performance across a diverse suite of benchmarks spanning reasoning, scientific QA, and open-domain QA. We find that retrieval consistently improves performance over…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

degenai-labs/RAG-scaling-laws
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.