TL;DR
This paper introduces SEM, a reinforcement learning framework that trains large language models to efficiently decide when to search externally, reducing redundant searches while maintaining high answer accuracy.
Contribution
The paper presents a novel post-training reinforcement learning approach, SEM, that explicitly optimizes search behavior in LLMs using a balanced dataset and structured reasoning templates.
Findings
Reduces redundant search operations significantly.
Maintains or improves answer accuracy on benchmarks.
Enhances reasoning efficiency in large language models.
Abstract
Recent advancements in Large Language Models(LLMs) have demonstrated their capabilities not only in reasoning but also in invoking external tools, particularly search engines. However, teaching models to discern when to invoke search and when to rely on their internal knowledge remains a significant challenge. Existing reinforcement learning approaches often lead to redundant search behaviors, resulting in inefficiencies and over-cost. In this paper, we propose SEM, a novel post-training reinforcement learning framework that explicitly trains LLMs to optimize search usage. By constructing a balanced dataset combining MuSiQue and MMLU, we create scenarios where the model must learn to distinguish between questions it can answer directly and those requiring external retrieval. We design a structured reasoning template and employ Group Relative Policy Optimization(GRPO) to post-train the…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The proposed SEM significantly reduces redundant search operations by training the model to exploit its internal knowledge on simple queries while perform searching when external knowledge is required. 2. The paper introduces a reward function that explicitly penalizes unnecessary searches and rewards effective retrieval, successfully improving search LLMs to distinguish queries it knows and those it doesn't, and only search for those complex ones.
1. This paper essentially combines reasoning and QA datasets to train a search agent that prioritieze answer correctness followed by the number of searches, with no novel technique / settings introduced. 2. Lack of datasets & baselines. Many baselines like IRCot and Search-R1 are not adopted as baselines in this paper, and further reasoning / QA datasets like MATH & 2Wiki should also be considered to improve the evaluation. 3. Limited training, analysis & observations to demonstrate the effica
- The paper studies an important problem because if they do this successfully, it can decrease the cost of training and inference. - The reward model designed in this paper is reasonable.
This paper has a very limited contribution that also doesn’t perform well compared to other studies. For example, [1] achieve much better results on the same datasets without including this adaptive scoring in their reward model. Additionally, many baselines for RAG is not provided that can be further included to make comparison fair. These baselines can be find in [1] paper. Another baseline that studies adaptive retrieval is [2] that should be included. In general, the introduced methods work
* **Well-Motivated Problem**. The dynamic reasoning/tool-calling issue that this paper aims to solve is a good, practical question. * **Good performance**. The method shows consistent improvements across multiple benchmarks.
1. **Limited Technical Novelty**. The paper fails to discuss or compare with established methods that address similar adaptive retrieval problems. Notable omissions include Self-RAG, Adaptive-RAG, and subsequent works that have proposed various solutions for determining when retrieval is necessary. The absence of these discussions makes it difficult to assess the novelty and relative merits of the proposed approach. 2. **Incomplete Experimental Analysis**. For agentic search agents, recent stron
1. The topic is interesting.
1. Poor Presentation Quality The presentation quality is below the expected standard, giving the impression that the paper was hastily prepared rather than carefully polished. For instance, in Section 2.2 (Reward Formulation), the equation unnecessarily occupies almost half a page, which significantly reduces readability. Similarly, in Section 2.3 (Training Template), the format is presented in raw text, which looks unprofessional and visually unappealing. A structured table or figure would conv
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
