A Theory for Token-Level Harmonization in Retrieval-Augmented Generation
Shicheng Xu, Liang Pang, Huawei Shen, Xueqi Cheng

TL;DR
This paper introduces a theoretical framework for understanding and balancing the benefit and detriment of retrieval-augmented generation (RAG) in language models, enabling prediction and optimization without additional training.
Contribution
It provides the first formal theory modeling RAG as a fusion of distributions and predicts RAG effects without training, leading to a novel token-level harmonization method called Tok-RAG.
Findings
Tok-RAG effectively balances benefit and detriment in experiments.
Theoretical predictions align with empirical results.
Method improves generation quality in real-world tasks.
Abstract
Retrieval-augmented generation (RAG) utilizes retrieved texts to enhance large language models (LLMs). Studies show that while RAG provides valuable external information (benefit), it may also mislead LLMs (detriment) with noisy or incorrect retrieved texts. Although many existing methods attempt to preserve benefit and avoid detriment, they lack a theoretical explanation for RAG. The benefit and detriment in the next token prediction of RAG remain a black box that cannot be quantified or compared in an explainable manner, so existing methods are data-driven, need additional utility evaluators or post-hoc. This paper takes the first step towards providing a theory to explain and trade off the benefit and detriment in RAG. First, we model RAG as the fusion between distribution of LLMs knowledge and distribution of retrieved texts. Then, we formalize the trade-off between the value of…
Peer Reviews
Decision·ICLR 2025 Poster
The paper has a fair bit of novel theoretical underpinnings for thinking about retrieved knowledge in terms of a latent concept variable. And if sound and practical, there is significance to the findings especially when it comes to merging two sources of information. The writing is overall clear (except for the items in weaknesses) and quality is fair as well.
The paper tries to do two things at once, provide both theoretical and practical justification. However, the theoretical justification is at best very convoluted and at worst incorrect. The practical justification is a good start, but the authors test on QA datasets with factoid answers potentially hiding the fact that their method may not generalize outside of this setting. It would be more convincing if the authors did one thing (either theoretical or practical) in a sound and convincing manne
1. This paper connects theory with practice, and the formula derivation is logically clear. Most of the papers are well written and explanatory, and are supported by experiments. 2. The standpoint of this paper is novel, using the perspective of latent variable models to explain RAG, and analyzing the distribution differences between LLM and external knowledge.
1. In Equation 2, it seems a bit forced to split it into two terms, even though we know that distribution fusion is not a simple addition of distributions. 2. From Equation 4 to 5, the symbol is used incorrectly; An equal cannot be used. 3. In Equation 7, the $P_r$ is not explained. Is it $p_R(r)$? 4. In Section 3.1, during the exploratory experiments on the distribution of retrieved texts $p_R(x_i|x_{1:i-1})$, it is mentioned that this distribution can be approximated using Equations 15 and 16
- The paper provides a theory to understand the benefit and detriment of the retrieved documents in RAG. - The paper introduces a method called token-RAG to leverage benefit and prevent detriment. - The authors conduct experiments on several datasets.
- The meaning of some notations is not well explained. What is z*? - Some formulations may be incorrect. Is there any problem with Eq.(2)? Given that p(z|R,x1:{i-1}) is a continuous function, how can you get the term corresponding to p(z|R,x1:{i-1}) out from the integral? - Why can you go from (4) to (5)? I believe it should not be “=”. Why p(R,x1:{i-1}|z*) is a constant? - The estimation of p_R(x_i|x_1:{i-1}) seems to be based on heuristics. I am not persuaded that this method can generalize
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMagnetic properties of thin films · Non-Destructive Testing Techniques · Electric Motor Design and Analysis
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · WordPiece · Linear Warmup With Linear Decay · Weight Decay · Attention Dropout · Linear Layer · Byte Pair Encoding · BART · Adam
