SSRL: Self-Search Reinforcement Learning

Yuchen Fan; Kaiyan Zhang; Heng Zhou; Yuxin Zuo; Yanxu Chen; Yu Fu; Xinwei Long; Xuekai Zhu; Che Jiang; Yuchen Zhang; Li Kang; Gang Chen; Cheng Huang; Zhizhou He; Bingning Wang; Lei Bai; Ning Ding; Bowen Zhou

arXiv:2508.10874·cs.CL·August 15, 2025

SSRL: Self-Search Reinforcement Learning

Yuchen Fan, Kaiyan Zhang, Heng Zhou, Yuxin Zuo, Yanxu Chen, Yu Fu, Xinwei Long, Xuekai Zhu, Che Jiang, Yuchen Zhang, Li Kang, Gang Chen, Cheng Huang, Zhizhou He, Bingning Wang, Lei Bai, Ning Ding, Bowen Zhou

PDF

3 Reviews

TL;DR

This paper explores how large language models can internally simulate search processes to improve reinforcement learning, reducing reliance on external tools and enabling cost-effective, scalable agent training.

Contribution

It introduces Self-Search RL (SSRL), a novel method that enhances LLMs' internal search capabilities through reward-based training, enabling efficient, self-contained RL agent development.

Findings

01

LLMs show strong scaling in search tasks with high pass@k scores

02

SSRL reduces dependence on external search engines

03

Models trained with SSRL facilitate robust sim-to-real transfer

Abstract

We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 4

Strengths

1. The authors performed extensive experiement on the sample size of search agents and argue that the performance can be improved by scaling sample sizes even without retrieving from external knowledge. Motivated by such observations, the proposed SSRL method eliminates the search API costs and improves the performance of search agents by training models to exploit their own internal knowledge. 2. Through extensive experiemnts, the authors show that models trained with SSRL can both work with i

Weaknesses

1. The authors did not propose novel insights or new learning framework. Instead, the work seems to be an improvement on the Search-R1 baseline, featuring more refined RL training techniques. Therefore resulting method rather seems to be an improved "R1-Base / Instruct" baseline without searching. 2. Although the experiment results show that LLM performance can match search agents with SSRL training, the model does not access external information or facts that may not exist within the embedded

Reviewer 02Rating 6Confidence 5

Strengths

1. The paper is very well written and easy to understand. 2. The inference time scaling experiments are interesting and insightful. 3. The study of “self-search” reinforcement learning is novel, and it is great to see that the model trained with “self-search” RL can generalize to adopting tools during inference. 4. The authors conduct extensive experiments to demonstrate the effectiveness of the proposed method.

Weaknesses

1. Lack of an ablation study for the format reward. In Section 3.2, the authors propose conducting outcome reward and format reward functions. However, there is no ablation study to verify the effectiveness of the format reward. 2. Is the method only suitable for a specific type of LLM? The main results in Table 1 are based on Llama models. It is questionable whether SSRL can still outperform other methods on other types of LLMs, such as Qwen. From Figure 5, it seems that the performance on Qwe

Reviewer 03Rating 4Confidence 4

Strengths

- The idea of leveraging the intrinsic search ability of LLMs to reduce reliance on external search engines during RL training is interesting and might be novel in this particular setting. I recognize that it addresses a practical challenge in training LLMs for agentic tasks. - The empirical evaluation is comprehensive, covering multiple LLM families and benchmarks. The results demonstrate the effectiveness of SSRL in improving search accuracy and reducing external search costs. Also, the Append

Weaknesses

- My biggest concern is about the clarity of the proposed approach. While the high-level idea of Self-Search and SSRL is understandable, some details are unclear to me. I believe the paper would benefit from more polishing to improve clarity. - Starting from Figure 1 (left), at first glance, it was not clear what the dotted box represented for the full-sim search section. - What is the instructions/prompts used. I'm assuming they are different when studying the Self-Search ability (Section 2

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.