Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang; Shiyao Cui; Yida Lu; Jingzhuo Zhou; Junxiao Yang; Hongning Wang; Minlie Huang

arXiv:2412.14470·cs.CL·May 21, 2025·2 cites

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, Minlie Huang

PDF

Open Access 1 Repo 1 Models 1 Datasets

TL;DR

This paper introduces Agent-SafetyBench, a comprehensive benchmark for evaluating the safety of LLM agents across various failure modes, revealing significant safety gaps and the need for more robust safety strategies.

Contribution

The paper presents a new benchmark with 349 environments and 2,000 test cases to evaluate LLM agent safety, highlighting current safety deficiencies and guiding future improvements.

Findings

01

None of the 16 evaluated agents scored above 60% safety.

02

Current safety issues include lack of robustness and risk awareness.

03

Defense prompts alone are insufficient for ensuring safety.

Abstract

As large language models (LLMs) are increasingly deployed as agents, their integration into interactive environments and tool use introduce new safety challenges beyond those associated with the models themselves. However, the absence of comprehensive benchmarks for evaluating agent safety presents a significant barrier to effective assessment and further improvement. In this paper, we introduce Agent-SafetyBench, a comprehensive benchmark designed to evaluate the safety of LLM agents. Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions. Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%. This highlights significant safety challenges in LLM agents and underscores the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

thu-coai/agent-safetybench
noneOfficial

Models

🤗
thu-coai/ShieldAgent
model· 798 dl· ♡ 2
798 dl♡ 2

Datasets

thu-coai/Agent-SafetyBench
dataset· 118 dl
118 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSafety Systems Engineering in Autonomy