Retrieval-Augmented Generation as Noisy In-Context Learning: A Unified Theory and Risk Bounds
Yang Guo, Yutian Tao, Yifei Ming, Robert D. Nowak, Yingyu Liang

TL;DR
This paper provides the first theoretical generalization bounds for retrieval-augmented generation (RAG), modeling it as noisy in-context learning and analyzing its bias-variance tradeoff, with empirical validation on QA benchmarks.
Contribution
It introduces a finite-sample generalization bound for RAG, unifies RAG and in-context learning under a common framework, and models retrieval noise from various sources.
Findings
RAG has an intrinsic ceiling on generalization error.
Sample efficiency of RAG and ICL is empirically validated.
Theoretical analysis reveals bias-variance tradeoff in RAG.
Abstract
Retrieval-augmented generation (RAG) has seen many empirical successes in recent years by aiding the LLM with external knowledge. However, its theoretical aspect has remained mostly unexplored. In this paper, we propose the first finite-sample generalization bound for RAG in in-context linear regression and derive an exact bias-variance tradeoff. Our framework views the retrieved texts as query-dependent noisy in-context examples and recovers the classical in-context learning (ICL) and standard RAG as the limit cases. Our analysis suggests that an intrinsic ceiling on generalization error exists on RAG as opposed to the ICL. Furthermore, our framework is able to model retrieval both from the training data and from external corpora by introducing uniform and non-uniform RAG noise. In line with our theory, we show the sample efficiency of ICL and RAG empirically with experiments on commonā¦
Peer Reviews
DecisionĀ·Submitted to ICLR 2026
1. It's great to see a paper that finally puts some solid theory behind RAG, which has been a mostly empirical field until now. 2. The idea of modeling RAG as noisy ICL is quite clever and provides a unified lens to understand its connection to standard in-context learning. 3. The experimental results do a good job of backing up the theoretical claims, especially in showing how performance can drop when you add too many retrieved examples.
1. The analysis is limited to a linear regression setting, which feels a bit disconnected from the complex, non-linear reality of modern language models. 2. The paper doesn't touch on how RAG fine-tuning might change the dynamics, which is a pretty common way people use RAG in practice. 3. While the noise models are a good start, they might be too simple to capture all the different ways retrieval can be "noisy" in the real world.
The most tangible contribution is a finite-sample, closed-form risk analysis that cleanly separates variance and bias effects of adding retrieved examples. Partition-depth-like roles are played by the number and distance of retrieved items, producing an optimal š* with diminishing returns O(1/m²). This provides decision-relevant guidance for when to stop retrieving and how to trade off close versus far items.
1. The theory rests on stylized assumptions, e.g., Gaussian linear data, LSA proxy, no RAG finetuning, and power-law retrieval distance distributions. These assumptions enable tractability, and the conclusions should be read as qualitative guidance. 2. The empirical section is still preliminary in scope, and does not benchmark against strong retrieval policies or compression strategies, such as errorātimeāmemory Pareto. Consequently, claims about optimal budgets and mixing ratios remain suggesti
* The paper offers a rigorous, unified view by casting RAG as noisy ICL and deriving finite-sample bounds plus an explicit biasāvariance split. * The query-dependent retrieval offset and the two noise regimes (uniform and distance-dependent) align well with practical retrieval behavior. * The theory isolates why extra retrieval beyond a point stops helping by showing variance reduction without bias reduction under uniform noise.
* The theoretical results rely on a single-layer linear self-attention model with Gaussian inputs and isotropic noises, which may limit transfer to modern nonlinear transformer stacks. * The evaluation metric in theory is MSE while the experiments report EM on QA, and the paper does not quantify how this metric mismatch affects conclusions. * The Gaussian retrieval-offset assumption simplifies real retrieval distributions and ignores indexing heuristics and filtering used in practice The ex
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling Ā· Information Retrieval and Search Behavior Ā· Multimodal Machine Learning Applications
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? Ā· Layer Normalization Ā· Linear Warmup With Linear Decay Ā· Attention Dropout Ā· Byte Pair Encoding Ā· Softmax Ā· Linear Layer Ā· Dropout Ā· Dense Connections Ā· Attention Is All You Need
