OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety
Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap

TL;DR
OpenAgentSafety is a modular framework for evaluating real-world AI agent safety across diverse tools and tasks, revealing significant safety vulnerabilities in current models through comprehensive analysis.
Contribution
It introduces a flexible, real-tool interaction evaluation framework with multi-faceted safety assessment combining rule-based and LLM judgments.
Findings
Over 50% of safety-vulnerable tasks show unsafe behavior in tested models.
Framework supports over 350 multi-turn, multi-user tasks.
Reveals critical safety vulnerabilities in prominent LLMs.
Abstract
Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real-world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to…
Peer Reviews
Decision·ICLR 2026 Poster
* The paper tackles a timely problem—evaluating the real-world safety of LLM-based agents—by integrating realistic environments, multiple user intents, and social interactions. * Experiments are conducted across diverse models and risk categories, using four questions (RQ1–RQ4) to guide the analysis. The results are comprehensive.
* OpenAgentSafety offers a well-engineered and comprehensive benchmark, but its novelty is limited. The framework mainly integrates existing components such as OpenHands and Sotopia, extending prior safety benchmarks rather than introducing new methodologies. Its hybrid evaluation and GPT-based task generation are incremental refinements of known techniques. * The paper does not describe a mechanism for filtering or validating generated tasks (e.g., to remove redundant, ill-posed, or low-quality
- Multi-user setting seems like an interesting setup in principle; however, I have critical concerns about NPC reproducibility, and the way the NPCs initiate chats - I like the three settings, combining user intent and NPC intent - Interesting discussions
- There is no info on the NPCs in the main paper nor the appendix: could you at least put some of the prompts? - Are NPC settings reproducible? Fixed open-weight model? What is the model? - When do NPC interact? Is it the agent or the NPC that initiate messages? Because if it is the agent, then if it just decides not to use the chat tool, it will not see malicious messages right? - For some of the tasks the evaluation of the outcome seems a bit arbitrary. My biggest concern is whether in all
## Originality The idea of benchmarking agent safety is not novel, however the specific execution of this benchmark is. In particular, I am not aware of another work that has points 1 and 2 from the summary, these being (1) as realistic environments and (2) different kinds of user / NPC intent. ## Quality and Clarity Overall the quality of the paper is high. The benchmark appears to be well thought out and design decisions justified in the writing. The results are good, and the research que
There are some weaknesses in the paper: 1. The benchmark is excellent, and accordingly I think the paper could be improved by evaluating more models on the benchmark. For example, by evaluating a number of OpenAI, Anthropic, and open source models, and analyzing this data, the reader would have a better idea of the "general" state of agent safety in the world today. 2. The paper would be improved by providing more information to the reader about what types of problems are in the dataset. E.g. p
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Explainable Artificial Intelligence (XAI)
