Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents
Miles Q. Li, Benjamin C. M. Fung, Boyang Li, Heba Ismail, Farkhund Iqbal

TL;DR
This paper systematically analyzes 40 safety benchmarks for AI agents, revealing methodological inconsistencies, coverage gaps, and the impact of evaluation choices on safety conclusions.
Contribution
It introduces a six-axis taxonomy for benchmark evaluation, catalogs existing benchmarks, and proposes standards to improve consistency and coverage in AI safety assessment.
Findings
Benchmark choice can lead to contradictory safety conclusions.
Coverage counts often overstate evaluation depth.
Environment fidelity influences safety reporting.
Abstract
The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
