OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

Sanidhya Vijayvargiya; Aditya Bharat Soni; Xuhui Zhou; Zora Zhiruo Wang; Nouha Dziri; Graham Neubig; Maarten Sap

arXiv:2507.06134·cs.AI·February 18, 2026

OpenAgentSafety: A Comprehensive Framework for Evaluating Real-World AI Agent Safety

Sanidhya Vijayvargiya, Aditya Bharat Soni, Xuhui Zhou, Zora Zhiruo Wang, Nouha Dziri, Graham Neubig, Maarten Sap

PDF

Open Access 3 Reviews

TL;DR

OpenAgentSafety is a modular framework for evaluating real-world AI agent safety across diverse tools and tasks, revealing significant safety vulnerabilities in current models through comprehensive analysis.

Contribution

It introduces a flexible, real-tool interaction evaluation framework with multi-faceted safety assessment combining rule-based and LLM judgments.

Findings

01

Over 50% of safety-vulnerable tasks show unsafe behavior in tested models.

02

Framework supports over 350 multi-turn, multi-user tasks.

03

Reveals critical safety vulnerabilities in prominent LLMs.

Abstract

Recent advances in AI agents capable of solving complex, everyday tasks, from scheduling to customer service, have enabled deployment in real-world settings, but their possibilities for unsafe behavior demands rigorous evaluation. While prior benchmarks have attempted to assess agent safety, most fall short by relying on simulated environments, narrow task domains, or unrealistic tool abstractions. We introduce OpenAgentSafety, a comprehensive and modular framework for evaluating agent behavior across eight critical risk categories. Unlike prior work, our framework evaluates agents that interact with real tools, including web browsers, code execution environments, file systems, bash shells, and messaging platforms; and supports over 350 multi-turn, multi-user tasks spanning both benign and adversarial user intents. OpenAgentSafety is designed for extensibility, allowing researchers to…

Peer Reviews

Decision·ICLR 2026 Poster

Reviewer 01Rating 4Confidence 3

Strengths

* The paper tackles a timely problem—evaluating the real-world safety of LLM-based agents—by integrating realistic environments, multiple user intents, and social interactions. * Experiments are conducted across diverse models and risk categories, using four questions (RQ1–RQ4) to guide the analysis. The results are comprehensive.

Weaknesses

* OpenAgentSafety offers a well-engineered and comprehensive benchmark, but its novelty is limited. The framework mainly integrates existing components such as OpenHands and Sotopia, extending prior safety benchmarks rather than introducing new methodologies. Its hybrid evaluation and GPT-based task generation are incremental refinements of known techniques. * The paper does not describe a mechanism for filtering or validating generated tasks (e.g., to remove redundant, ill-posed, or low-quality

Reviewer 02Rating 4Confidence 3

Strengths

- Multi-user setting seems like an interesting setup in principle; however, I have critical concerns about NPC reproducibility, and the way the NPCs initiate chats - I like the three settings, combining user intent and NPC intent - Interesting discussions

Weaknesses

- There is no info on the NPCs in the main paper nor the appendix: could you at least put some of the prompts? - Are NPC settings reproducible? Fixed open-weight model? What is the model? - When do NPC interact? Is it the agent or the NPC that initiate messages? Because if it is the agent, then if it just decides not to use the chat tool, it will not see malicious messages right? - For some of the tasks the evaluation of the outcome seems a bit arbitrary. My biggest concern is whether in all

Reviewer 03Rating 8Confidence 4

Strengths

## Originality The idea of benchmarking agent safety is not novel, however the specific execution of this benchmark is. In particular, I am not aware of another work that has points 1 and 2 from the summary, these being (1) as realistic environments and (2) different kinds of user / NPC intent. ## Quality and Clarity Overall the quality of the paper is high. The benchmark appears to be well thought out and design decisions justified in the writing. The results are good, and the research que

Weaknesses

There are some weaknesses in the paper: 1. The benchmark is excellent, and accordingly I think the paper could be improved by evaluating more models on the benchmark. For example, by evaluating a number of OpenAI, Anthropic, and open source models, and analyzing this data, the reader would have a better idea of the "general" state of agent safety in the world today. 2. The paper would be improved by providing more information to the reader about what types of problems are in the dataset. E.g. p

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdversarial Robustness in Machine Learning · Security and Verification in Computing · Explainable Artificial Intelligence (XAI)