Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios
Zuoyu Zhang, Yancheng Zhu

TL;DR
This paper introduces ROME, a benchmark pipeline that rewrites unsafe trajectories into deceptive instances to evaluate safety judgment, and ARISE, a retrieval-based method to improve safety inference without retraining.
Contribution
The paper presents ROME for creating challenging deceptive safety benchmarks and ARISE for enhancing safety judgment via analogical reasoning, addressing distribution shift issues.
Findings
Challenge sets significantly reduce safety judgment performance.
Hidden-risk cases remain difficult even for advanced models.
ARISE improves judgment quality without retraining.
Abstract
Tool-using agent systems powered by large language models (LLMs) are increasingly deployed across web, app, operating-system, and transactional environments. Yet existing safety benchmarks still emphasize explicit risks, potentially overstating a model's ability to judge deceptive or ambiguous trajectories. To address this gap, we introduce ROME (Red-team Orchestrated Multi-agent Evolution), a controlled benchmark-construction pipeline that rewrites known unsafe trajectories into more deceptive evaluation instances while preserving their underlying risk labels. Starting from 100 unsafe source trajectories, ROME produces 300 challenge instances spanning contextual ambiguity, implicit risks, and shortcut decision-making. Experiments show that these challenge sets substantially degrade safety-judgment performance, with hidden-risk cases remaining particularly non-trivial even for recent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
