TL;DR
This paper explores how large language models can internally simulate search processes to improve reinforcement learning, reducing reliance on external tools and enabling cost-effective, scalable agent training.
Contribution
It introduces Self-Search RL (SSRL), a novel method that enhances LLMs' internal search capabilities through reward-based training, enabling efficient, self-contained RL agent development.
Findings
LLMs show strong scaling in search tasks with high pass@k scores
SSRL reduces dependence on external search engines
Models trained with SSRL facilitate robust sim-to-real transfer
Abstract
We investigate the potential of large language models (LLMs) to serve as efficient simulators for agentic search tasks in reinforcement learning (RL), thereby reducing dependence on costly interactions with external search engines. To this end, we first quantify the intrinsic search capability of LLMs via structured prompting and repeated sampling, which we term Self-Search. Our results reveal that LLMs exhibit strong scaling behavior with respect to the inference budget, achieving high pass@k on question-answering benchmarks, including the challenging BrowseComp task. Building on these observations, we introduce Self-Search RL (SSRL), which enhances LLMs' Self-Search capability through format-based and rule-based rewards. SSRL enables models to iteratively refine their knowledge utilization internally, without requiring access to external tools. Empirical evaluations demonstrate that…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The authors performed extensive experiement on the sample size of search agents and argue that the performance can be improved by scaling sample sizes even without retrieving from external knowledge. Motivated by such observations, the proposed SSRL method eliminates the search API costs and improves the performance of search agents by training models to exploit their own internal knowledge. 2. Through extensive experiemnts, the authors show that models trained with SSRL can both work with i
1. The authors did not propose novel insights or new learning framework. Instead, the work seems to be an improvement on the Search-R1 baseline, featuring more refined RL training techniques. Therefore resulting method rather seems to be an improved "R1-Base / Instruct" baseline without searching. 2. Although the experiment results show that LLM performance can match search agents with SSRL training, the model does not access external information or facts that may not exist within the embedded
1. The paper is very well written and easy to understand. 2. The inference time scaling experiments are interesting and insightful. 3. The study of “self-search” reinforcement learning is novel, and it is great to see that the model trained with “self-search” RL can generalize to adopting tools during inference. 4. The authors conduct extensive experiments to demonstrate the effectiveness of the proposed method.
1. Lack of an ablation study for the format reward. In Section 3.2, the authors propose conducting outcome reward and format reward functions. However, there is no ablation study to verify the effectiveness of the format reward. 2. Is the method only suitable for a specific type of LLM? The main results in Table 1 are based on Llama models. It is questionable whether SSRL can still outperform other methods on other types of LLMs, such as Qwen. From Figure 5, it seems that the performance on Qwe
- The idea of leveraging the intrinsic search ability of LLMs to reduce reliance on external search engines during RL training is interesting and might be novel in this particular setting. I recognize that it addresses a practical challenge in training LLMs for agentic tasks. - The empirical evaluation is comprehensive, covering multiple LLM families and benchmarks. The results demonstrate the effectiveness of SSRL in improving search accuracy and reducing external search costs. Also, the Append
- My biggest concern is about the clarity of the proposed approach. While the high-level idea of Self-Search and SSRL is understandable, some details are unclear to me. I believe the paper would benefit from more polishing to improve clarity. - Starting from Figure 1 (left), at first glance, it was not clear what the dotted box represented for the full-sim search section. - What is the instructions/prompts used. I'm assuming they are different when studying the Self-Search ability (Section 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
