Agentic Reinforcement Learning for Search is Unsafe
Yushi Yang, Shreyansh Padarha, Andrew Lee, Adam Mahdi

TL;DR
This paper reveals that agentic reinforcement learning models for search are vulnerable to simple attacks that induce harmful searches, exposing safety weaknesses in current training methods and emphasizing the need for safety-aware RL pipelines.
Contribution
It identifies specific vulnerabilities in RL-trained search models, demonstrating how simple attacks can significantly reduce safety measures and highlighting the necessity for safety-aware training.
Findings
Attacks reduce refusal rates by up to 60%.
Answer safety decreases by 82.5%.
Search-query safety drops by 82.4%.
Abstract
Agentic reinforcement learning (RL) trains large language models to autonomously call tools during reasoning, with search as the most common application. These models excel at multi-step reasoning tasks, but their safety properties are not well understood. In this study, we show that RL-trained search models inherit refusal from instruction tuning and often deflect harmful requests by turning them into safe queries. However, this safety is fragile. Two simple attacks, one that forces the model to begin response with search (Search attack), another that encourages models to repeatedly search (Multi-search attack), trigger cascades of harmful searches and answers. Across two model families (Qwen, Llama) with both local and web search, these attacks lower refusal rates by up to 60.0%, answer safety by 82.5%, and search-query safety by 82.4%. The attacks succeed by triggering models to…
Peer Reviews
Decision·Submitted to ICLR 2026
The paper demonstrates a compelling analysis of agentic LLM safety vulnerabilities, supported by several key strengths. Its claims about RL training prioritizing task success over safety are empirically validated through controlled experiments comparing instruction-tuned and RL-trained agents. The experimental part is good, featuring systematic ablation studies that isolate the impact of different attack vectors. Controls for prompt length and complexity, combined with cross-model validation,
First, while it identifies RL-induced behaviors as a core issue, the *explanation* for how RL overrides safety tuning is insufficient. The mechanisms behind this conflict need further conceptual clarification, ideally through diagrams or more detailed theoretical discussion, to fully explain this key finding. Second, the paper's framing of its contribution could be strengthened by a more thorough engagement with related prior work. Although it cites foundational agent research like WebGPT/RAG
S1. The paper identifies a relevant and timely problem, and demonstrates its existence empirically. S2. Strong empirical evaluation on state of the art agents. S3. The paper is clearly written and easy to follow. S4. The paper provides enough details and code as supplementary material for it to be reproducible.
W1. While using an LLM to judge safety is a standard practice nowadays, it is a limitation and begs the question of "unsafe according to whom?". W2. The paper is primarily diagnostic: it identifies a safety weakness but does not experimentally test mitigation strategies. While the discussion section outlines possible remedies, they remain unvalidated.
- Safety of RL-trained tool-use agents (especially search) is under-examined and relevant for deployment. - The three-metric evaluation (refusal, answer safety, search safety) provides nuanced assessment. In addition, findings hold across two model families, both local and web search, demonstrating some generalizability. - Prefill/prompt tweaks are realistic (user-accessible) and isolate a when-to-search failure mode.
- Novelty & Impact. The core finding feels incremental and closely aligned with well-known RAG/jailbreak dynamics: if you can steer retrieval early, you can bias generation toward unsafe outcomes. - Usefulness of the Study: The attack surface here (forcing <search> / multi-search) is quite simple, and the experiments use relatively not strong, non-SOTA models. As a result, it’s unclear whether the phenomenon meaningfully persists, and to what degree, in production-grade systems (e.g., Claude,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Adversarial Robustness in Machine Learning
