Retrieval Capabilities of Large Language Models Scale with Pretraining FLOPs
Jacob Portes, Connor Jennings, Erica Ji Yuen, Sasha Doubov, Michael Carbin

TL;DR
This paper investigates how the retrieval capabilities of large language models improve with increased pretraining FLOPs, showing predictable scaling and strong correlations with in-context learning performance.
Contribution
It provides a comprehensive benchmark of retrieval performance across various LLM sizes and training scales, revealing key scaling relationships and implications for LLM-based retrieval systems.
Findings
Retrieval performance scales predictably with model size and FLOPs.
In-Context Learning scores correlate strongly with retrieval performance.
Scaling laws inform development of LLM-based retrievers.
Abstract
How does retrieval performance scale with pretraining FLOPs? We benchmark retrieval performance across LLM model sizes from 125 million parameters to 7 billion parameters pretrained on datasets ranging from 1 billion tokens to more than 2 trillion tokens. We find that retrieval performance on zero-shot BEIR tasks predictably scales with LLM size, training duration, and estimated FLOPs. We also show that In-Context Learning scores are strongly correlated with retrieval scores across retrieval tasks. Finally, we highlight the implications this has for the development of LLM-based retrievers.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
