WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection
Guanzhong He, Zhen Yang, Jinxin Liu, Bin Xu, Lei Hou, Juanzi Li

TL;DR
WebSeer introduces a reinforcement learning-based search agent with a self-reflection mechanism, enabling longer tool-use chains and improved accuracy in web-based information retrieval tasks.
Contribution
It presents a novel two-stage training framework with self-reflection, enhancing tool-use depth and accuracy in search agents trained with reinforcement learning.
Findings
Achieves state-of-the-art results on HotpotQA and SimpleQA datasets.
Extends tool-use chains significantly compared to prior methods.
Demonstrates strong generalization to out-of-distribution datasets.
Abstract
Search agents have achieved significant advancements in enabling intelligent information retrieval and decision-making within interactive environments. Although reinforcement learning has been employed to train agentic models capable of more dynamic interactive retrieval, existing methods are limited by shallow tool-use depth and the accumulation of errors over multiple iterative interactions. In this paper, we present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism. Specifically, we construct a large dataset annotated with reflection patterns and design a two-stage training framework that unifies cold start and reinforcement learning within the self-reflection paradigm for real-world web-based environments, which enables the model to generate longer and more reflective tool-use trajectories. Our approach…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper studies a relevant problem: Tool calling (such as Web searches) is a very timely and challenging area of research, unveiling a lot of potential for LLMs. - The presented WebSteer approach is sound and, to the best of my knowledge, novel. - The experiments ablate two key choices made: 1) self-reflection and 2) the cold-start initialization via SFT, both proved to be beneficial. - Trained LLMs reach SOTA benchmark scores for their sizes.
- The analysis claims that the method's effectiveness is heavily dependent on model capacity. Only the 14B model showed consistent performance and stable behavior improvements, while smaller 3B and 7B models showed degraded performance after SFT and instability during RL. I find this quite puzzling and would like to know if the authors have further explanations on why this is the case. In particular, some qualitative example after SFT or during RL would further illustrate the cause. - The abla
1. The paper tackles the relevant and interesting problem of training tool-use agents. It presents a solid approach based on RL, with a notable mechanism for encouraging self-reflection (though some technical details remain unclear, as noted in the questions below). 2. The empirical performance is strong. The model outperforms baselines on most target datasets and also demonstrates good OOD generalization. 3. The analysis in Sec. 3.3 provides helpful insights on behaviors of tool-use agents trai
1. Limited discussion of related work: There is a significant body of work on using RL for general tool use, not limited to search [1,2,3]. The paper would be stronger if it more comprehensively discussed and positioned itself relative to this broader line of related work. 2. Narrow experimental scope: The evaluation is focused on multi-hop question answering. While I think this is acceptable and does not undermine the paper's core contributions, showing the framework's effectiveness on other ty
- Clear and practical idea: learn long, reflective tool use in a clean Wiki-only setup, then run on the open web. The two-stage SFT → SRRL recipe is simple, and “submit answer” is treated as a tool, which makes stopping explicit. - Strong results across many benchmarks and tough OOD sets, with consistent gains over strong web-agent baselines. - Good behavior analysis and ablations that explain why it works: tool-call depth becomes more balanced, removing SRRL hurts, and the SFT data mix matters.
- The open-web transfer story is promising but the ablations are only partial. The paper already shows that limiting RL to a single submission hurts, analyzes tool-use distributions, and studies how the SFT data mix changes behavior. What’s missing are comparisons across training regimes (restricted vs mixed vs open), a sensitivity check for the page-reading/normalization stack, and systematic curves that relate accuracy to chain length under different tool budgets, search-depth caps, and submis
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Multimodal Machine Learning Applications
