ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li; Haoyu Luo; Yuejin Xie; Yuqian Fu; Zhonghao Yang; Shuai Shao; Qihan Ren; Wanying Qu; Yanwei Fu; Yujiu Yang; Jing Shao; Xia Hu; and Dongrui Liu

arXiv:2604.02022·cs.AI·May 14, 2026

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, and Dongrui Liu

PDF

5 Models 5 Datasets

TL;DR

ATBench is a comprehensive, realistic benchmark for evaluating agent safety across multi-step interactions, addressing diversity, observability, and long-horizon realism in risk assessment.

Contribution

The paper introduces ATBench, a novel trajectory-level benchmark with diverse scenarios, long-context protocols, and detailed taxonomy for structured safety evaluation.

Findings

01

ATBench contains 1,000 trajectories with balanced safe and unsafe cases.

02

Experiments show ATBench challenges current LLMs and guard systems.

03

The benchmark enables detailed analysis of long-horizon failure patterns.

Abstract

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.