AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Yunhao Feng; Yifan Ding; Yingshui Tan; Xingjun Ma; Yige Li; Yutao Wu; Yifeng Gao; Kun Zhai; Yanming Guo

arXiv:2604.02947·cs.AI·April 6, 2026

AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, Yanming Guo

PDF

1 Datasets

TL;DR

AgentHazard is a comprehensive benchmark designed to evaluate the potential for harmful behaviors in computer-use agents, highlighting current vulnerabilities despite model alignment efforts.

Contribution

This paper introduces AgentHazard, a new benchmark with 2,653 instances to assess safety risks in autonomous agents across diverse attack strategies and operational scenarios.

Findings

01

Current systems show high vulnerability, with Qwen3-Coder powered Claude Code having a 73.63% attack success rate.

02

Agents often fail to recognize and interrupt harmful behaviors emerging from complex, multi-step sequences.

03

Model alignment alone does not ensure safety in autonomous agent systems.

Abstract

Computer-use agents extend language models from text generation to persistent action over tools, files, and execution environments. Unlike chat systems, they maintain state across interactions and translate intermediate outputs into concrete actions. This creates a distinct safety challenge in that harmful behavior may emerge through sequences of individually plausible steps, including intermediate actions that appear locally acceptable but collectively lead to unauthorized actions. We present \textbf{AgentHazard}, a benchmark for evaluating harmful behavior in computer-use agents. AgentHazard contains \textbf{2,653} instances spanning diverse risk categories and attack strategies. Each instance pairs a harmful objective with a sequence of operational steps that are locally legitimate but jointly induce unsafe behavior. The benchmark evaluates whether agents can recognize and interrupt…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Yunhao-Feng/AgentHazard
dataset· 334 dl
334 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.