How Brittle is Agent Safety? Rethinking Agent Risk under Intent Concealment and Task Complexity
Zihan Ma, Dongsheng Zhu, Shudong Liu, Taolin Zhang, Junnan Liu, Qingqiu Li, Minnan Luo, Songyang Zhang, Kai Chen

TL;DR
This paper introduces OASIS, a benchmark suite to evaluate the brittleness of agent safety under intent concealment and task complexity, revealing critical vulnerabilities and paradoxes in current safety assessments.
Contribution
The paper presents a novel two-dimensional analysis framework and a hierarchical benchmark, OASIS, for assessing agent safety in complex and intent-obscured scenarios.
Findings
Safety degrades sharply with intent concealment.
Agents appear safer on harder tasks due to capability limits.
Introduction of a hierarchical benchmark with detailed annotations.
Abstract
Current safety evaluations for LLM-driven agents primarily focus on atomic harms, failing to address sophisticated threats where malicious intent is concealed or diluted within complex tasks. We address this gap with a two-dimensional analysis of agent safety brittleness under the orthogonal pressures of intent concealment and task complexity. To enable this, we introduce OASIS (Orthogonal Agent Safety Inquiry Suite), a hierarchical benchmark with fine-grained annotations and a high-fidelity simulation sandbox. Our findings reveal two critical phenomena: safety alignment degrades sharply and predictably as intent becomes obscured, and a "Complexity Paradox" emerges, where agents seem safer on harder tasks only due to capability limitations. By releasing OASIS and its simulation environment, we provide a principled foundation for probing and strengthening agent safety in these overlooked…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The paper introduces a new two-dimensional benchmark with per-step harm labels. The involvement of domain-experts and double verification by authors in benchmark preparation is good. 2. The paper shows how concealed intent and task complexity jointly affect safety at different levels of either one, which seems logical for making comprehensive decision about agent’s safety capability than unidirectional measurement. 3. The benchmark is evaluated on 8 different LLMs and identified several int
1. The paper states that “all tasks were synthesized using Gemini 2.5 Pro,” but does not clarify how Gemini generated these tasks, or what prompting or control strategy was used. Overall, how they were generated. Without proper justification, it’s difficult to assess whether the resulting tasks are realistic or represent an actual real-world scenario. 2. Although the evaluations are well-organized, with the small benchmark, it’s difficult to say if the findings are actually statistically genera
- The authors show how intent concealment and task complexity interact to influence the safety performance of language-model agents. They highlight the “complexity paradox” where the observation that agents may appear *safer* in more complex scenarios simply because they fail to act, reflecting capability limitations rather than genuine safety awareness. - The paper introduces a diagnostic framework that evaluates both process and outcome through diverse metrics such as the *Hierarchical Refusal
The most critical weakness of this paper lies in its lack of clear explanations, definitions, and transparency. Many descriptions of the experimental setup are vague or informal—closer in tone to a blog post than an academic paper. As a result, the work falls short of reproducibility standards: it is difficult to fully understand the authors’ design decisions or replicate their experiments. In particular, no concrete examples of datasets, task instances, or tool usage are provided, which further
- The paper's primary strength is its novel problem formulation. By shifting the focus from "atomic harms" to the more realistic, orthogonal dimensions of "intent concealment" and "task complexity," it reveals a critical and overlooked gap in safety research. - The paper goes beyond reporting simple refusal rates. The discovery and classification of "static, pre-execution" vs. "dynamic, in-workflow" safety mechanisms is a key insight into *how* safety systems fail.
- While acknowledged by the authors, the curated set of 53 general-purpose tools is a limitation. Real-world agents will need to interact with thousands of dynamic, heterogeneous, and evolving third-party APIs. It is unclear how these findings (especially the "Complexity Paradox") will scale when the complexity of tool use itself. - While the sandbox is described as "high-fidelity," the tasks are ultimately synthetic, and the tool outputs are pre-synthesized. This means the agent cannot elicit n
I agree that we would need a more sophisticated benchmark to evaluate the AI agent safety issue. I think the idea of testing safety on these two dimensions of concealment and complexity is a good direction. The paper also has some interesting findings, like the "Complexity Paradox" and the fact that some models are "static" in their safety checks while others are "dynamic".
First, the paper says agents seem safer on complex tasks, but it could be due to their "planning capabilities" failing them, and they can't complete the task. Therefore, how would the authors distinguish the results from to be a test of capability and a new insight about safety alignment? Second, the way the benchmark (OASIS) was created seems a bit circular. The paper says the tasks were "synthesized using Gemini 2.5 Pro" and then validated by humans. Could the authors provide more informatio
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSafety Systems Engineering in Autonomy · Occupational Health and Safety Research · Adversarial Robustness in Machine Learning
