REX-RAG: Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation
Wentao Jiang, Xiang Feng, Zengmao Wang, Yong Luo, Pingbo Xu, Zhe Chen, Bo Du, Jing Zhang

TL;DR
REX-RAG introduces a novel reinforcement learning framework for retrieval-augmented generation that enhances reasoning exploration and policy correction, leading to improved performance on question-answering benchmarks.
Contribution
The paper proposes REX-RAG, a new framework combining mixed sampling and importance sampling-based policy correction to improve reasoning exploration in RL for LLMs.
Findings
Achieves 5.1% performance gain on Qwen2.5-3B
Achieves 3.6% performance gain on Qwen2.5-7B
Demonstrates effective exploration and policy correction in reasoning tasks
Abstract
Reinforcement learning (RL) is emerging as a powerful paradigm for enabling large language models (LLMs) to perform complex reasoning tasks. Recent advances indicate that integrating RL with retrieval-augmented generation (RAG) allows LLMs to dynamically incorporate external knowledge, leading to more informed and robust decision making. However, we identify a critical challenge during policy-driven trajectory sampling: LLMs are frequently trapped in unproductive reasoning paths, which we refer to as "dead ends", committing to overconfident yet incorrect conclusions. This severely hampers exploration and undermines effective policy optimization. To address this challenge, we propose REX-RAG (Reasoning Exploration with Policy Correction in Retrieval-Augmented Generation), a novel framework that explores alternative reasoning paths while maintaining rigorous policy learning through…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The authors study an important problem of dead end in RL-based RAG settings, where the policy is often unable to generate correct reasoning paths for complex input queries. 2. The authors introduces effective techniques to improve the training dynamics on the rollouts and additional correction continuations by the probe policy. 3. REX-RAG shows strong performance on open-domain QA datasets, suggesting the model learns improved reasoning patterns for multi-turn search LLMs.
1. The entire exploration and correction mechanism is training-only, as it relies on ground truth labels (rather than learning a verifier) to identify dead ends and trigger exploration. Therefore in inference, these mechanisms are deactivated and the model cannot really correct itself if it heads down to incorrect reasoning paths. 2. The exploration & additional sampling introduces extra computation overhead during the training phase, which may be potentially unfair to baselines like Search-R1
The proposed method seems sound and effective based on the reported results, which makes it convincing that this approach is useful for training policies that better interact with search engines (or tools). Additionally, the experiment setup is very comprehensive and is accompanied by a good set of ablations that clarify the effectiveness of each component in the system.
The proposed method is significantly more expensive than the baselines, specifically the most similar baseline, search-r1. It is not clear for me the improvements observed here are from the increased number of sampling during training or because of the sampling strategy. We know that number of rollouts in the training can significantly increase the compute budget of training and improving performance. I am curios to see if this method still performs better than search-r1 if it uses the same num
- The paper presents an interesting and well-written idea. - It addresses a compelling question in agentic reinforcement learning: how can we enable LLMs to learn to reflect, or more generally, when introducing an external policy (e.g., for reflection), how can we ensure the policy we want to learn remains on-policy? The paper provides reasonable and well-motivated solutions, including (1) filtering and (2) distribution realignment. - The experimental evaluation is comprehensive and thoughtful
- The appendix pages are missing, which makes it difficult to fully understand several key parts of the work. - The training process is somewhat complicated and not clearly explained, leading to confusion. For example, it is unclear what the complete training pipeline looks like — whether the two policies are trained jointly or sequentially, and whether they share parameters. - The ambiguity in describing the training procedure, along with the complexity of the overall training pipeline, make
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Natural Language Processing Techniques
