WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

Guanzhong He; Zhen Yang; Jinxin Liu; Bin Xu; Lei Hou; Juanzi Li

arXiv:2510.18798·cs.CL·October 22, 2025

WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection

Guanzhong He, Zhen Yang, Jinxin Liu, Bin Xu, Lei Hou, Juanzi Li

PDF

Open Access 3 Reviews

TL;DR

WebSeer introduces a reinforcement learning-based search agent with a self-reflection mechanism, enabling longer tool-use chains and improved accuracy in web-based information retrieval tasks.

Contribution

It presents a novel two-stage training framework with self-reflection, enhancing tool-use depth and accuracy in search agents trained with reinforcement learning.

Findings

01

Achieves state-of-the-art results on HotpotQA and SimpleQA datasets.

02

Extends tool-use chains significantly compared to prior methods.

03

Demonstrates strong generalization to out-of-distribution datasets.

Abstract

Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 6Confidence 3

Strengths

- The paper studies a relevant problem: Tool calling (such as Web searches) is a very timely and challenging area of research, unveiling a lot of potential for LLMs. - The presented WebSteer approach is sound and, to the best of my knowledge, novel. - The experiments ablate two key choices made: 1) self-reflection and 2) the cold-start initialization via SFT, both proved to be beneficial. - Trained LLMs reach SOTA benchmark scores for their sizes.

Weaknesses

- The analysis claims that the method's effectiveness is heavily dependent on model capacity. Only the 14B model showed consistent performance and stable behavior improvements, while smaller 3B and 7B models showed degraded performance after SFT and instability during RL. I find this quite puzzling and would like to know if the authors have further explanations on why this is the case. In particular, some qualitative example after SFT or during RL would further illustrate the cause. - The abla

Reviewer 02Rating 8Confidence 4

Strengths

1. The paper tackles the relevant and interesting problem of training tool-use agents. It presents a solid approach based on RL, with a notable mechanism for encouraging self-reflection (though some technical details remain unclear, as noted in the questions below). 2. The empirical performance is strong. The model outperforms baselines on most target datasets and also demonstrates good OOD generalization. 3. The analysis in Sec. 3.3 provides helpful insights on behaviors of tool-use agents trai

Weaknesses

1. Limited discussion of related work: There is a significant body of work on using RL for general tool use, not limited to search [1,2,3]. The paper would be stronger if it more comprehensively discussed and positioned itself relative to this broader line of related work. 2. Narrow experimental scope: The evaluation is focused on multi-hop question answering. While I think this is acceptable and does not undermine the paper's core contributions, showing the framework's effectiveness on other ty

Reviewer 03Rating 6Confidence 3

Strengths

- Clear and practical idea: learn long, reflective tool use in a clean Wiki-only setup, then run on the open web. The two-stage SFT → SRRL recipe is simple, and “submit answer” is treated as a tool, which makes stopping explicit. - Strong results across many benchmarks and tough OOD sets, with consistent gains over strong web-agent baselines. - Good behavior analysis and ablations that explain why it works: tool-call depth becomes more balanced, removing SRRL hurts, and the SFT data mix matters.

Weaknesses

- The open-web transfer story is promising but the ablations are only partial. The paper already shows that limiting RL to a single submission hurts, analyzes tool-use distributions, and studies how the SFT data mix changes behavior. What’s missing are comparisons across training regimes (restricted vs mixed vs open), a sensitivity check for the page-reading/normalization stack, and systematic curves that relate accuracy to chain length under different tool budgets, search-depth caps, and submis

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications