SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

Sheng Yin; Xianghe Pang; Yuanzhuo Ding; Menglan Chen; Yutong Bi; Yichen Xiong; Wenhao Huang; Zhen Xiang; Jing Shao; and Siheng Chen

arXiv:2412.13178·cs.CR·November 3, 2025·2 cites

SafeAgentBench: A Benchmark for Safe Task Planning of Embodied LLM Agents

Sheng Yin, Xianghe Pang, Yuanzhuo Ding, Menglan Chen, Yutong Bi, Yichen Xiong, Wenhao Huang, Zhen Xiang, Jing Shao, and Siheng Chen

PDF

Open Access 1 Repo 1 Datasets 3 Reviews

TL;DR

SafeAgentBench is a comprehensive benchmark designed to evaluate safety-aware task planning in embodied LLM agents within interactive environments, addressing critical safety risks overlooked by existing benchmarks.

Contribution

It introduces the first safety-focused benchmark with a diverse dataset, a universal environment, and evaluation methods for embodied LLM agents, highlighting safety challenges in real-world tasks.

Findings

01

Agents show significant variation in task success rates.

02

Overall safety awareness among agents remains weak.

03

Replacing the LLM does not significantly improve safety awareness.

Abstract

With the integration of large language models (LLMs), embodied agents have strong capabilities to understand and plan complicated natural language instructions. However, a foreseeable issue is that those embodied agents can also flawlessly execute some hazardous tasks, potentially causing damages in the real world. Existing benchmarks predominantly overlook critical safety risks, focusing solely on planning performance, while a few evaluate LLMs' safety awareness only on non-interactive image-text data. To address this gap, we present SafeAgentBench -- the first comprehensive benchmark for safety-aware task planning of embodied LLM agents in interactive simulation environments, covering both explicit and implicit hazards. SafeAgentBench includes: (1) an executable, diverse, and high-quality dataset of 750 tasks, rigorously curated to cover 10 potential hazards and 3 task types; (2)…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The paper addresses a highly relevant and timely topic, focusing on the safety of embodied AI systems at a time when LLM-based robotic task planning is rapidly expanding. 2. The benchmark fills an existing gap by shifting the focus from task completion to evaluating how agents respond to hazardous instructions. 3. The work provides a useful starting point for further research on safety evaluation and benchmarking for embodied LLM agents. 4. The design of SafeAgentBench is well thought out, in

Weaknesses

1. The overall contribution feels incremental since the paper focuses mainly on dataset and benchmark construction rather than proposing new algorithms or methods that enhance safety. 2. SafeAgentEnv adds limited novelty because it is largely an adaptation of AI2-THOR with only minor extensions. 3. The evaluation is conducted entirely in simulation, and the paper does not discuss how the findings would transfer to real-world or physical robot scenarios. 4. Mainly reliance on GPT-4 for both datas

Reviewer 02Rating 2Confidence 5

Strengths

The paper addresses the critical and timely problem of embodied agent safety. As agents become more capable, understanding their failure modes and safety awareness is of paramount importance to the field.

Weaknesses

## Major Weaknesses: **1.Critically Outdated and Irrelevant Model Selection:** The paper's experimental setup is fundamentally flawed by its exclusive reliance on pure text-based LLMs. - **Lack of Multimodality:** Embodied agents, by definition, must perceive and interact with their environment. This requires processing multimodal inputs, primarily visual information (e.g., images, depth maps). The paper’s core evaluation, however, uses GPT-4 (a text-only model) as the central planner, with

Reviewer 03Rating 2Confidence 4

Strengths

- The paper aims to provide a benchmark and evaluation for robot practitioners to build upon by ensuring VLM-based planners plan safely before executing low-level control policies. - Assuming you are developing methods assuming an LLM/VLM agent that generates high-level task plans, this work can be useful. - The plots are well formatted and are clear to read.

Weaknesses

- Narrow Focus on LLM Agents for Embodied Task Reasoning - The paper is focused on the use of LLM agents for embodied task reasoning. While this could be useful in some contexts where developers may choose to use an LLM for reasoning, it’s unclear why one would be motivated to do so currently. - We’ve seen lots of developments around VLM agents. Why not expand to VLM agents as well? It seems like some methods rely on VLMs, in which case the authors should make this more explicit. - B

Code & Models

Repositories

shengyin1224/safeagentbench
noneOfficial

Datasets

safeagentbench/SafeAgentBench
dataset· 309 dl
309 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMulti-Agent Systems and Negotiation · Reinforcement Learning in Robotics · Advanced Malware Detection Techniques