Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

Wenlin Zhang; Xiangyang Li; Kuicai Dong; Yichao Wang; Pengyue Jia; Xiaopeng Li; Yingyi Zhang; Derong Xu; Zhaocheng Du; Huifeng Guo; Ruiming Tang; Xiangyu Zhao

arXiv:2505.14069·cs.IR·December 9, 2025

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning

Wenlin Zhang, Xiangyang Li, Kuicai Dong, Yichao Wang, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Derong Xu, Zhaocheng Du, Huifeng Guo, Ruiming Tang, Xiangyu Zhao

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces ReasonRAG, a process-supervised reinforcement learning framework for agentic RAG systems that uses process-level rewards to improve training efficiency and performance, outperforming prior methods on multiple benchmarks.

Contribution

It proposes a novel process-level reward approach and constructs RAG-ProGuide dataset to enhance agentic RAG training with fewer instances and better results.

Findings

01

ReasonRAG outperforms Search-R1 and traditional RAG on five benchmarks.

02

Achieves superior performance with only 5k training instances.

03

Reduces training data requirements significantly compared to prior methods.

Abstract

Retrieval-augmented generation (RAG) enhances the text generation capabilities of large language models (LLMs) by integrating external knowledge and up-to-date information. However, traditional RAG systems are limited by static workflows and lack the adaptability required for multistep reasoning and complex task management. To address these limitations, agentic RAG systems (e.g., DeepResearch) have been proposed, enabling dynamic retrieval strategies, iterative context refinement, and adaptive workflows for handling complex search queries beyond the capabilities of conventional RAG. Recent advances, such as Search-R1, have demonstrated promising gains using outcome-based reinforcement learning, where the correctness of the final answer serves as the reward signal. Nevertheless, such outcome-supervised agentic RAG methods face challenges including low exploration efficiency, gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wlzhang2020/reasonrag
pytorchOfficial

Videos

Process vs. Outcome Reward: Which is Better for Agentic RAG Reinforcement Learning· slideslive

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Information Retrieval and Search Behavior

MethodsAttention Is All You Need · Linear Warmup With Linear Decay · Softmax · Attention Dropout · WordPiece · Refunds@Expedia|||How do I get a full refund from Expedia? · Linear Layer · Residual Connection · Byte Pair Encoding · Weight Decay