Search Self-play: Pushing the Frontier of Agent Capability without Supervision
Hongliang Lu, Yuhang Wen, Pengyu Cheng, Ruijin Ding, Jiaqi Guo, Haotian Xu, Chutian Wang, Haonan Chen, Xiaoxi Jiang, Guanjun Jiang

TL;DR
This paper introduces search self-play (SSP), a novel self-supervised training method for deep search agents that enhances their performance across benchmarks without supervision by co-evolving task proposers and solvers.
Contribution
The paper presents a new self-play framework for search agents that generates and solves tasks with increasing difficulty, improving capabilities without human-labeled data.
Findings
SSP significantly improves search agent performance on various benchmarks.
The method works effectively in both from-scratch and continuous RL training.
SSP reduces reliance on human-crafted task queries and ground-truth answers.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding ground-truth answers to provide accurate rewards, which requires significant human effort and hinders the scaling of RL processes, especially in agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The…
Peer Reviews
Decision·ICLR 2026 Poster
- The paper tackles the very relevant and challenging problem of improving LLMs for search and information retrieval. - The zero-sum game formulation is very interesting but at the same time easily exploitable by both agents and thus I would have expect to produce degenerate solutions. Instead, the proposed constrained approach and overall meta-algorithm seems quite robust and able to produce stable improvements, as measured by the improved downstream performance during training. - The experimen
- I don’t understand why the proposer and solver are fine tuned with different update rules, REINFORCE and GRPO respectively. Couldn’t the same algorithm, say GRPO, be applied to both? This was not clear to me. - Are both the proposer and solver updated at each step? I would expect more stable training dynamics to update the proposer less frequently. I would be curious to know the author’s view and experience on this. Also I think this would help understand the training dynamics. - It would be
1. Novel Problem Formulation: The paper addresses a fundamental challenge in agentic RL training, Data Scarcity, through an elegant self-play mechanism that grounds question generation in external search rather than relying solely on the model's internal knowledge. 2. Thorough Ablation Studies: The paper provides valuable insights through ablations on training schemes (co-evolution vs fixed-opponent), batch completion strategies, RAG verification configurations, and reward design, with detailed
1. Limited Evaluation Setting: All experiments use a local E5 retriever with a Wikipedia 2018 corpus, which is significantly more constrained than the actual web search scenarios. The paper doesn't evaluate whether SSP-trained agents generalize to real web search or more recent knowledge bases. The recently released BrowseComp dataset is a good candidate for evaluation. 2. Lack of question quality analysis: While the paper demonstrates that the proposer generated increasingly difficult questions
1. This work is novel in its application of self-play to agentic search. 2. The paper provides precise definitions of constraints and rewards, fully discloses prompts and hyperparameters, and includes training curves that illustrate the co-evolution of proposer and solver. 3. By reducing reliance on annotated agentic data, SSP achieves consistent, substantial improvements across different model architectures and scales, paving a practical path toward scalable, unsupervised training for agentic
1. Please discuss training stability under the adversarial/cooperative setup. Does the min–max dynamic cause frequent collapses or mode drift? How easy is convergence in practice? Please provide stability evidence (multiple seeds, valid-question rate over time, reward variance) and, if available, theoretical or empirical guarantees that prevent reward hacking and proposer entropy explosions. 2. Please report end-to-end compute and cost: total wall-clock, GPU hours, tokens processed, search call
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Artificial Intelligence in Games
