ZeroSearch: Incentivize the Search Capability of LLMs without Searching
Hao Sun, Zile Qiao, Jiayan Guo, Xuanbo Fan, Yingyan Hou, Yong Jiang, Pengjun Xie, Yan Zhang, Fei Huang, Jingren Zhou

TL;DR
ZeroSearch introduces a reinforcement learning framework that enhances LLMs' search abilities by using simulated searches during training, reducing costs and improving robustness.
Contribution
It presents a novel RL approach with curriculum-based degraded document quality to train LLMs for effective search without relying on real-time search engine access.
Findings
A 7B retrieval module matches real search engine performance.
A 14B retrieval module surpasses real search engine performance.
The method generalizes across different model sizes and RL algorithms.
Abstract
Effective information searching is essential for enhancing the reasoning and generation capabilities of large language models (LLMs). Recent research has explored using reinforcement learning (RL) to improve LLMs' search capabilities by interacting with live search engines in real-world environments. While these approaches show promising results, they face two major challenges: (1) Uncontrolled Document Quality: The quality of documents returned by search engines is often unpredictable, introducing noise and instability into the training process. (2) Prohibitively High API Costs: RL training requires frequent rollouts, potentially involving hundreds of thousands of search requests, which incur substantial API expenses and severely constrain scalability. To address these challenges, we introduce ZeroSearch, a novel RL framework that incentivizes the capabilities of LLMs to use a real…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
1) The problem explored in this work is relatively novel. While this may not be the first time that someone trains a model to imitate a system or API, it is new to my knowledge in this domain (e.g., for RL towards RAG). 2) The ideas and presentation are simple and easy to follow. The results are compared against several recent related papers, and it is easy to appreciate from them that the core of this methods 'works' on the benchmarks selected.
Weakness 1: The authors offer surprisingly little insight into why or how this method works at all. In the general case, it is simply not reasonable to expect that one can imitate an arbitrary search API with imitation learning. This is because there's, in principle, irreducible information in the search engine's knowledge of the world or certain domains that are simply not available in the model's pre-training data. This argument does not imply that the method proposed in this paper will fail
1. The proposed ZeroSearch trains the model using a simulated search engine, which removes the high costs of making frequent search calls during RL training. 2. Despite not using a real search engine, it solves the problem by using a simulating LLM and control the quality of generated documents with a curriculum learning-based rollout mechanism. 3. The proposed method is effective on single and multi-hop QA tasks, with experiments showing that simulator search engines can match the performance
1. The training of ZeroSearch still requires extensive engine calls, which is used to curate the training data for the search simulation LLM. Consider that Search-R1 can converge within a few hundred steps, it is questionable if ZeroSearch really saves on the search costs. 2. While ZeroSearch can reduce API costs to train the policy model, the method introduces a new, significant computational cost in training and deploying a separate LLM as the simulation server, and the costs of which should
- The paper is well written, and the proposed approach is clearly presented. - The main idea of the framework is interesting and well motivated. Since the goal of RL training here is to teach the model how to use a search tool, simulating search with a fixed, noisy model that provides outdated or imperfect information effectively avoids the cost of real API calls while improving robustness to inconsistent or low-quality retrievals. - The experimental section is extensive and promising, covering
- The paper would benefit from quantitative results on the robustness of the reward choice. The authors mention that exact matching was prone to reward hacking, but this claim is only stated qualitatively. Including some quantitative results, or an additional ablation study comparing reward formulations, would make the argument stronger and more convincing.
* Addresses an important practical bottleneck in RL-based search training — high API costs and unstable document quality — with a controlled simulation-based setup. * The curriculum degradation and document quality control mechanisms are conceptually reasonable and could improve stability in RL training. * The writing and experimental organization are clear and the comparisons cover multiple model families (Qwen, LLaMA).
1. **Metric validity (critical)** The paper exclusively reports Exact Match (EM) as its main evaluation metric. However, EM doesn't capture search quality, **especially for prompting-based method** (an LLM fine-tuned w/ EM could outperform prompting-based method w/ search), so the performance table is not trustworthy. 2. **Unfair baseline setup** For non-multi-turn baselines like RAG, the comparison is imbalanced. If ZeroSearch effectively accesses 4 turns × 3 documents = 12 documents (at ma
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Information Retrieval and Search Behavior · Natural Language Processing Techniques
MethodsBalanced Selection
