Inference Scaling for Long-Context Retrieval Augmented Generation
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi, Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky

TL;DR
This paper explores how scaling inference computation in retrieval augmented generation (RAG) with long-context LLMs improves performance, introduces inference scaling laws, and predicts optimal compute allocation for enhanced knowledge utilization.
Contribution
It introduces inference scaling laws for RAG, models optimal test-time compute allocation, and demonstrates significant performance gains through scaled inference configurations.
Findings
Scaling inference compute yields nearly linear performance gains.
The proposed model accurately predicts optimal inference parameters.
Scaling inference achieves up to 58.9% improvements on benchmarks.
Abstract
The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring the combination of multiple strategies beyond simply increasing the quantity of knowledge, including in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs' ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance…
Peer Reviews
Decision·ICLR 2025 Oral
The paper studies two interesting research questions including the scaling behavior and the prediction of test-time computation allocation long-context RAG methods. The paper conducts systematical experiments on inference scaling of long-context RAG models, and reveals the scaling properties of DRAG and IterDRAG, i.e., the performance improves almost linearly with optimal configuration. Besides, the computational allocation model generalizes well across domains and context lengths, which potenti
I have a question on the application of the computational allocation model. When pretraining LLMs, computational allocation models are crucial since pretraining is extremely resource-intensive. However, inference is typically much less costly by comparison. So, why not determine the best configuration by simply searching it?
Inference-time scaling laws for RAG systems -- extremely interesting, and the community really needs an analysis like this one.
It is not clear whether the current analysis may generalise to future SFT/RLHF regimens.
The research question is quite interesting, as there is not much work on inference time scaling for RAG; this study systematically explores this area and may draw some attention.
I am concerned that this work is more suitable as a technical report rather than a research-oriented study. There is considerable related work combining long-context LLMs and RAG, and the main contribution of this work is mainly the proposed RAG inference scaling law. However, this conclusion is method-specific and may not apply to other methods.
Videos
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning · Image Retrieval and Classification Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Attention Dropout · Linear Layer · Weight Decay · Linear Warmup With Linear Decay · Dropout · Byte Pair Encoding · BERT
