When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Mahdi Astaraki; Mohammad Arshi Saloot; Ali Shiraee Kasmaee; Hamidreza Mahyar; Soheila Samiee

arXiv:2601.19827·cs.CL·May 5, 2026

When Iterative RAG Beats Ideal Evidence: A Diagnostic Study in Scientific Multi-hop Question Answering

Mahdi Astaraki, Mohammad Arshi Saloot, Ali Shiraee Kasmaee, Hamidreza Mahyar, Soheila Samiee

PDF

1 Datasets

TL;DR

This study demonstrates that iterative retrieval-reasoning loops in scientific multi-hop question answering often outperform static approaches, especially for non-reasoning models, by reducing failures and enabling dynamic correction.

Contribution

It provides the first controlled diagnostic analysis showing iterative RAG can surpass ideal static evidence in scientific multi-hop QA, with practical deployment insights.

Findings

01

Iterative RAG outperforms Gold Context by up to 25.6 percentage points.

02

Staged retrieval reduces late-hop failures and context overload.

03

Remaining challenges include incomplete hop coverage and early stopping calibration.

Abstract

Retrieval-Augmented Generation (RAG) extends large language models (LLMs) beyond parametric knowledge, yet it is unclear when iterative retrieval-reasoning loops meaningfully outperform static RAG, particularly in scientific domains with multi-hop reasoning, sparse domain knowledge, and heterogeneous evidence. We provide the first controlled, mechanism-level diagnostic study of whether synchronized iterative retrieval and reasoning can surpass an idealized static upper bound (Gold Context) RAG. We benchmark eleven state-of-the-art LLMs under three regimes: (i) No Context, measuring reliance on parametric memory; (ii) Gold Context, where all oracle evidence is supplied at once; and (iii) Iterative RAG, a training-free controller that alternates retrieval, hypothesis refinement, and evidence-aware stopping. Using the chemistry-focused ChemKGMultiHopQA dataset, we isolate questions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

molmohsen/awesome-ai-agent-papers
dataset· 41 dl
41 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.