SEM: Reinforcement Learning for Search-Efficient Large Language Models

Zeyang Sha; Shiwen Cui; Weiqiang Wang

arXiv:2505.07903·cs.CL·May 14, 2025

SEM: Reinforcement Learning for Search-Efficient Large Language Models

Zeyang Sha, Shiwen Cui, Weiqiang Wang

PDF

4 Reviews

TL;DR

This paper introduces SEM, a reinforcement learning framework that trains large language models to efficiently decide when to search externally, reducing redundant searches while maintaining high answer accuracy.

Contribution

The paper presents a novel post-training reinforcement learning approach, SEM, that explicitly optimizes search behavior in LLMs using a balanced dataset and structured reasoning templates.

Findings

01

Reduces redundant search operations significantly.

02

Maintains or improves answer accuracy on benchmarks.

03

Enhances reasoning efficiency in large language models.

Abstract

Recent advancements in Large Language Models(LLMs) have demonstrated their capabilities not only in reasoning but also in invoking external tools, particularly search engines. However, teaching models to discern when to invoke search and when to rely on their internal knowledge remains a significant challenge. Existing reinforcement learning approaches often lead to redundant search behaviors, resulting in inefficiencies and over-cost. In this paper, we propose SEM, a novel post-training reinforcement learning framework that explicitly trains LLMs to optimize search usage. By constructing a balanced dataset combining MuSiQue and MMLU, we create scenarios where the model must learn to distinguish between questions it can answer directly and those requiring external retrieval. We design a structured reasoning template and employ Group Relative Policy Optimization(GRPO) to post-train the…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

1. The proposed SEM significantly reduces redundant search operations by training the model to exploit its internal knowledge on simple queries while perform searching when external knowledge is required. 2. The paper introduces a reward function that explicitly penalizes unnecessary searches and rewards effective retrieval, successfully improving search LLMs to distinguish queries it knows and those it doesn't, and only search for those complex ones.

Weaknesses

1. This paper essentially combines reasoning and QA datasets to train a search agent that prioritieze answer correctness followed by the number of searches, with no novel technique / settings introduced. 2. Lack of datasets & baselines. Many baselines like IRCot and Search-R1 are not adopted as baselines in this paper, and further reasoning / QA datasets like MATH & 2Wiki should also be considered to improve the evaluation. 3. Limited training, analysis & observations to demonstrate the effica

Reviewer 02Rating 2Confidence 4

Strengths

- The paper studies an important problem because if they do this successfully, it can decrease the cost of training and inference. - The reward model designed in this paper is reasonable.

Weaknesses

This paper has a very limited contribution that also doesn’t perform well compared to other studies. For example, [1] achieve much better results on the same datasets without including this adaptive scoring in their reward model. Additionally, many baselines for RAG is not provided that can be further included to make comparison fair. These baselines can be find in [1] paper. Another baseline that studies adaptive retrieval is [2] that should be included. In general, the introduced methods work

Reviewer 03Rating 2Confidence 5

Strengths

* **Well-Motivated Problem**. The dynamic reasoning/tool-calling issue that this paper aims to solve is a good, practical question. * **Good performance**. The method shows consistent improvements across multiple benchmarks.

Weaknesses

1. **Limited Technical Novelty**. The paper fails to discuss or compare with established methods that address similar adaptive retrieval problems. Notable omissions include Self-RAG, Adaptive-RAG, and subsequent works that have proposed various solutions for determining when retrieval is necessary. The absence of these discussions makes it difficult to assess the novelty and relative merits of the proposed approach. 2. **Incomplete Experimental Analysis**. For agentic search agents, recent stron

Reviewer 04Rating 0Confidence 5

Strengths

1. The topic is interesting.

Weaknesses

1. Poor Presentation Quality The presentation quality is below the expected standard, giving the impression that the paper was hastily prepared rather than carefully polished. For instance, in Section 2.2 (Reward Formulation), the equation unnecessarily occupies almost half a page, which significantly reduces readability. Similarly, in Section 2.3 (Training Template), the format is presented in raw text, which looks unprofessional and visually unappealing. A structured table or figure would conv

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.