Inference Scaling for Bridging Retrieval and Augmented Generation
Youngwon Lee, Seung-won Hwang, Daniel Campos, Filip Grali\'nski,, Zhewei Yao, Yuxiong He

TL;DR
This paper introduces a novel inference scaling method called Mixture-of-Intervention (MOI) that mitigates generator bias in retrieval-augmented generation, improving performance on multiple benchmarks by aggregating multiple inference passes.
Contribution
The paper proposes MOI, a new inference technique that reduces bias in RAG models and leverages retriever knowledge to enhance efficiency and accuracy.
Findings
Improves ROUGE-L on MS MARCO by ~7 points.
Enhances EM on HotpotQA by ~7 points.
Reduces computational cost through optimized permutation strategies.
Abstract
Retrieval-augmented generation (RAG) has emerged as a popular approach to steering the output of a large language model (LLM) by incorporating retrieved contexts as inputs. However, existing work observed the generator bias, such that improving the retrieval results may negatively affect the outcome. In this work, we show such bias can be mitigated, from inference scaling, aggregating inference calls from the permuted order of retrieved contexts. The proposed Mixture-of-Intervention (MOI) explicitly models the debiased utility of each passage with multiple forward passes to construct a new ranking. We also show that MOI can leverage the retriever's prior knowledge to reduce the computational cost by minimizing the number of permutations considered and lowering the cost per LLM call. We showcase the effectiveness of MOI on diverse RAG tasks, improving ROUGE-L on MS MARCO and EM on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Speech Recognition and Synthesis · Domain Adaptation and Few-Shot Learning
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Attention Is All You Need · Dense Connections · Byte Pair Encoding · Multi-Head Attention · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay
