DeepRAG: Thinking to Retrieve Step by Step for Large Language Models
Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, Jie Zhou

TL;DR
DeepRAG introduces a novel framework that models retrieval-augmented reasoning as an MDP, enabling adaptive retrieval and significantly improving answer accuracy in large language models.
Contribution
It proposes DeepRAG, a method that dynamically decomposes queries and decides between retrieval and parametric reasoning, enhancing retrieval efficiency and accuracy.
Findings
Improves answer accuracy by 26.4%.
Enhances retrieval efficiency in reasoning tasks.
Effectively models retrieval as an MDP for adaptive reasoning.
Abstract
Large Language Models (LLMs) have shown remarkable reasoning capabilities, while their practical applications are limited by severe factual hallucinations due to limitations in the timeliness, accuracy, and comprehensiveness of their parametric knowledge. Meanwhile, enhancing retrieval-augmented generation (RAG) with reasoning remains challenging due to ineffective task decomposition and redundant retrieval, which can introduce noise and degrade response quality. In this paper, we propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP), enabling reasonable and adaptive retrieval. By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency and boosts answer accuracy by 26.4%, demonstrating…
Peer Reviews
Decision·ICLR 2026 Poster
1. Well-Motivated Problem Existing RAG systems apply retrieval indiscriminately—either over-retrieving (wasting compute) or under-retrieving (missing information). DeepRAG addresses this through atomic decisions: decompose queries into subqueries, then decide retrieval necessity per subquery rather than per original query. 2. Strong Empirical Results 25.41% accuracy improvement over baselines (Table 1) Better retrieval efficiency: lower average steps than Multi-Step Retrieval (Table 2) Good gen
1. Overstated Technical Novelty The "MDP formulation" is superficial packaging: Transitions are deterministic (not stochastic as typical MDPs) No actual policy search—just supervised learning on pre-computed trajectories Binary Tree Search is exhaustive enumeration (2^N paths), not a novel algorithm 2. Narrow Evaluation Scope Only tested on artificially constructed multi-hop QA benchmarks (HotpotQA, 2WikiMultiHop, MuSiQue). Missing evaluations on: Single-hop questions (does it over-retrieve?
- Strong empirical gains: 25.41% accuracy lift is reported across five datasets, with ablation likely showing each stage’s contribution; out-of-dist PopQA and Freebase-absent WebQuestions stress robustness. - Uses a fixed-depth priority queue (lowest retrieval count first) and discards unsolvable instances, yielding high-quality imitation data without oracle subqueries.
- The method doesn’t seem to offer much innovation. Among the many existing approaches that use reinforcement learning for autonomous multi-turn retrieval, I didn’t find any particularly striking or novel insights. - All experiments assume a fixed Wikipedia retriever (presumably BM25 or Contriever). What about comparison with some deep research methods, they also use multi-turn retrieval?
* The paper is well written * When to conduct retrieval is an important topic * The experiments are extensive
* Compared to Serach R1, the paper mainly differs at using the model's native generation ability to determine when to retrieve, but this has already been well studied and the authors do not provide new solution.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
