Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Zhengyang Tang; Yi Zhang; Chenxin Li; Xin Lai; Pengyuan Lyu; Yiduo Guo; Weinong Wang; Junyi Li; Yang Ding; Huawen Shen; Zhengyao Fang; Xingran Zhou; Liang Wu; Fei Tang; Sunqi Fan; Shangpin Peng; Zheng Ruan; Anran Zhang; Benyou Wang; Chengquan Zhang; Han Hu

arXiv:2605.07630·cs.CL·May 11, 2026

Safe, or Simply Incapable? Rethinking Safety Evaluation for Phone-Use Agents

Zhengyang Tang, Yi Zhang, Chenxin Li, Xin Lai, Pengyuan Lyu, Yiduo Guo, Weinong Wang, Junyi Li, Yang Ding, Huawen Shen, Zhengyao Fang, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Chengquan Zhang, Han Hu

PDF

TL;DR

This paper introduces PhoneSafety, a benchmark for evaluating phone-use agents that distinguishes between unsafe actions and inability to act, revealing that better app performance does not always mean safer behavior.

Contribution

The paper presents PhoneSafety, a new benchmark with 700 real-world scenarios to better evaluate safety by separating unsafe decisions from inability to act.

Findings

01

Stronger app performance does not guarantee safer decisions.

02

Failures to act are linked to visual and operational complexity.

03

Unsafe choices and inability to act are distinct failure modes.

Abstract

When a phone-use agent avoids harm, does that show safety, or simply inability to act? Existing evaluations often cannot tell. A harmful outcome may be avoided because the agent recognized the risk and chose the safe action, or because it failed to understand the screen or execute any relevant action at all. These cases have different causes and call for different fixes, yet current benchmarks often merge them under task success, refusal, or final harmful outcome. We address this problem with PhoneSafety, a benchmark of 700 safety-critical moments drawn from real phone interactions across more than 130 apps. Each instance isolates the next decision at a risky moment and asks a simple question: does the model take the safe action, take the unsafe action, or fail to do anything useful? We evaluate eight representative phone-use agents under this framework. Our results reveal two main…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.