How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Yinuo Xu; Shuo Lu; Jianjie Cheng; Meng Wang; Qianlong Xie; Xingxing Wang; Ran He; Jian Liang

arXiv:2602.19526·cs.CL·February 24, 2026

How to Train Your Deep Research Agent? Prompt, Reward, and Policy Optimization in Search-R1

Yinuo Xu, Shuo Lu, Jianjie Cheng, Meng Wang, Qianlong Xie, Xingxing Wang, Ran He, Jian Liang

PDF

Open Access

TL;DR

This paper systematically studies reinforcement learning strategies in Deep Research agents, revealing key insights on prompt templates, reward functions, and optimization methods, and introduces an improved baseline called Search-R1++.

Contribution

It provides a comprehensive analysis of RL components in Deep Research agents and proposes a new baseline that enhances performance based on these insights.

Findings

01

Fast Thinking template improves stability and performance.

02

F1 reward underperforms EM due to answer avoidance, mitigated by penalties.

03

REINFORCE outperforms PPO in efficiency and stability.

Abstract

Deep Research agents tackle knowledge-intensive tasks through multi-round retrieval and decision-oriented generation. While reinforcement learning (RL) has been shown to improve performance in this paradigm, its contributions remain underexplored. To fully understand the role of RL, we conduct a systematic study along three decoupled dimensions: prompt template, reward function, and policy optimization. Our study reveals that: 1) the Fast Thinking template yields greater stability and better performance than the Slow Thinking template used in prior work; 2) the F1-based reward underperforms the EM due to training collapse driven by answer avoidance; this can be mitigated by incorporating action-level penalties, ultimately surpassing EM; 3) REINFORCE outperforms PPO while requiring fewer search actions, whereas GRPO shows the poorest stability among policy optimization methods. Building…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Reinforcement Learning in Robotics