Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

Jacob Portes; Connor Jennings; Erica Ji Yuen; Sasha Doubov; Michael Carbin

arXiv:2508.17400·cs.LG·August 26, 2025

Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs

Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin

PDF

TL;DR

This paper investigates how the retrieval capabilities of large language models improve with increased pretraining FLOPs, showing predictable scaling and strong correlations with in-context learning performance.

Contribution

It provides a comprehensive benchmark of retrieval performance across various LLM sizes and training scales, revealing key scaling relationships and implications for LLM-based retrieval systems.

Findings

01

Retrieval performance scales predictably with model size and FLOPs.

02

In-Context Learning scores correlate strongly with retrieval performance.

03

Scaling laws inform development of LLM-based retrievers.

Abstract

How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.