Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang; Shreyansh Padarha; Andrew Lee; Adam Mahdi

arXiv:2510.17431·cs.CL·October 21, 2025

Agentic Reinforcement Learning for Search is Unsafe

Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi

PDF

Open Access 3 Reviews

TL;DR

This paper reveals that agentic reinforcement learning models for search are vulnerable to simple attacks that induce harmful searches, exposing safety weaknesses in current training methods and emphasizing the need for safety-aware RL pipelines.

Contribution

It identifies specific vulnerabilities in RL-trained search models, demonstrating how simple attacks can significantly reduce safety measures and highlighting the necessity for safety-aware training.

Findings

01

Attacks reduce refusal rates by up to 60%.

02

Answer safety decreases by 82.5%.

03

Search-query safety drops by 82.4%.

Abstract

Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 2

Strengths

The paper demonstrates a compelling analysis of agentic LLM safety vulnerabilities, supported by several key strengths. Its claims about RL training prioritizing task success over safety are empirically validated through controlled experiments comparing instruction-tuned and RL-trained agents. The experimental part is good, featuring systematic ablation studies that isolate the impact of different attack vectors. Controls for prompt length and complexity, combined with cross-model validation,

Weaknesses

First, while it identifies RL-induced behaviors as a core issue, the *explanation* for how RL overrides safety tuning is insufficient. The mechanisms behind this conflict need further conceptual clarification, ideally through diagrams or more detailed theoretical discussion, to fully explain this key finding. Second, the paper's framing of its contribution could be strengthened by a more thorough engagement with related prior work. Although it cites foundational agent research like WebGPT/RAG

Reviewer 02Rating 8Confidence 2

Strengths

S1. The paper identifies a relevant and timely problem, and demonstrates its existence empirically. S2. Strong empirical evaluation on state of the art agents. S3. The paper is clearly written and easy to follow. S4. The paper provides enough details and code as supplementary material for it to be reproducible.

Weaknesses

W1. While using an LLM to judge safety is a standard practice nowadays, it is a limitation and begs the question of "unsafe according to whom?". W2. The paper is primarily diagnostic: it identifies a safety weakness but does not experimentally test mitigation strategies. While the discussion section outlines possible remedies, they remain unvalidated.

Reviewer 03Rating 2Confidence 4

Strengths

- Safety of RL-trained tool-use agents (especially search) is under-examined and relevant for deployment. - The three-metric evaluation (refusal, answer safety, search safety) provides nuanced assessment. In addition, findings hold across two model families, both local and web search, demonstrating some generalizability. - Prefill/prompt tweaks are realistic (user-accessible) and isolate a when-to-search failure mode.

Weaknesses

- Novelty & Impact. The core finding feels incremental and closely aligned with well-known RAG/jailbreak dynamics: if you can steer retrieval early, you can bias generation toward unsafe outcomes. - Usefulness of the Study: The attack surface here (forcing <search> / multi-search) is quite simple, and the experiments use relatively not strong, non-SOTA models. As a result, it’s unclear whether the phenomenon meaningfully persists, and to what degree, in production-grade systems (e.g., Claude,

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning