AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, Yanming Guo

TL;DR
AgentHazard is a comprehensive benchmark designed to evaluate the potential for harmful behaviors in computer-use agents, highlighting current vulnerabilities despite model alignment efforts.
Contribution
This paper introduces AgentHazard, a new benchmark with 2,653 instances to assess safety risks in autonomous agents across diverse attack strategies and operational scenarios.
Findings
Current systems show high vulnerability, with Qwen3-Coder powered Claude Code having a 73.63% attack success rate.
Agents often fail to recognize and interrupt harmful behaviors emerging from complex, multi-step sequences.
Model alignment alone does not ensure safety in autonomous agent systems.
Abstract
Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present \textbf{AgentHazard}, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains \textbf{2,653} instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
