TL;DR
FrugalRAG introduces a two-stage reinforcement learning framework that adaptively reduces retrieval steps in multi-hop question answering, achieving high accuracy with significantly lower retrieval costs.
Contribution
It presents a novel RL-based method that prunes search depth based on question difficulty, improving efficiency without sacrificing performance in RAG tasks.
Findings
Achieves state-of-the-art efficiency-accuracy tradeoffs on HotPotQA.
Reduces retrieval cost nearly by half compared to prior methods.
Generalizes zero-shot to challenging benchmarks like BrowseCompPlus.
Abstract
Reinforcement learning (RL) based on the final answer's reward has driven recent progress in small language models (SLMs) on reasoning-heavy tasks such as math and code. However, applying the same techniques to retrieval-augmented generation (RAG) benchmarks like multi-hop QA has yielded limited gains, often trailing supervised or prompting-only baselines. Instead, we argue that a viable path for RL in multi-hop QA is to use test-time scaling judiciously to optimize both final answer accuracy and efficiency in reaching that answer. We propose FrugalRAG, a two-stage finetuning framework that adaptively reduces the number of retrieval steps based on a question's difficulty. First, we train an SLM with supervised finetuning on a full-exploration policy that generates broad sub-queries. Then, we apply RL to adaptively prune search depth based on question difficulty, directly rewarding…
Peer Reviews
Decision·ICLR 2026 Poster
1. Simple yet effective idea: learn when to stop searching rather than always searching more. 2. Strong empirical results with small data (1k); especially on HotPotQA under standardized E5/FlashRAG settings 3. Adaptive compute validated quantitatively and qualitatively (fewer searches on easy, more on hard). 4. Generalization to BrowseCompPlus with more hops than seen in training. 5. Can pair with different retrievers and answerers; demonstration with CoRAG
I don't identify major weakness for this work, but have a few questions: 1. Reward design depends on oracle‑like signals (evidence recall) and somewhat opaque hyperparameters 2. No baseline learned/heuristic stoppers, so it’s unclear how much RL beats simpler termination rules. 3. Efficiency claims would benefit from runtime/FLOPs and ablation on budget B and sample count v. 4. Practicality depends on availability of ground‑truth evidence for training.
S1. Clear problem focus. The paper presents a well-defined motivation by targeting the inefficiency problem in multi-hop RAG systems, where existing RL/RAG agents often over-retrieve and waste searches. FrugalRAG directly addresses this issue by introducing an adaptive stopping mechanism that halts retrieval once sufficient evidence is gathered, ensuring both cost efficiency and effectiveness. S2. Logical two-stage design. The technical structure is intuitive and easy to follow. Stage 1 uses su
W1. Concerns about technical design (1) Line 251 defines “$\Delta$ is the normalized difference $(h_\text{term} - h) / B$”, but only $h_\text{term}$, $h*$, and $B$ are defined before. Should it be “$\Delta$ = $(h_\text{term} - h) / B$”? This matches the idea that the penalty depends on how far you are from the optimal stopping hop. (2) Lines 243–249 define a piecewise reward $R$ for late stop, perfect stop, and early stop. The “early stop” case is triggered by $c < \tau$ (low answer quality),
1. Separating exploration (coverage) from stopping (efficiency) is a clean, actionable design; the reward explicitly ties termination to an “optimal” hop length and format compliance. 2. Tables show fewer searches at comparable or better MBE/recall versus strong prompting and SFT baselines, and the “tradeoff” metric highlights practical benefits. 3. The method increases search steps with question difficulty and shows some cross-dataset transfer; the modularity across retrievers/generators is a
1. Motivation around “less data” is under-argued: The paper asserts success with ~1k examples, but the core motivation, why low-label regimes are critical for RAG, feels underspecified. In many RAG settings, labeled QA pairs or weak labels are not rare (e.g., synthetic or distantly-supervised pipelines), and the paper does not clarify whether performance keeps improving with more data (scaling behavior is central to RAG systems). A scaling curve (1k→10k→100k) would make this claim concrete. 2.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
