Do Phone-Use Agents Respect Your Privacy?

Zhengyang Tang; Ke Ji; Xidong Wang; Zihan Ye; Xinyuan Wang; Yiduo Guo; Ziniu Li; Chenxin Li; Jingyuan Hu; Shunian Chen; Tongxu Luo; Jiaxi Bi; Zeyu Qin; Shaobo Wang; Xin Lai; Pengyuan Lyu; Junyi Li; Can Xu; Chengquan Zhang; Han Hu; Ming Yan; Benyou Wang

arXiv:2604.00986·cs.CR·April 3, 2026

Do Phone-Use Agents Respect Your Privacy?

Zhengyang Tang, Ke Ji, Xidong Wang, Zihan Ye, Xinyuan Wang, Yiduo Guo, Ziniu Li, Chenxin Li, Jingyuan Hu, Shunian Chen, Tongxu Luo, Jiaxi Bi, Zeyu Qin, Shaobo Wang, Xin Lai, Pengyuan Lyu, Junyi Li, Can Xu, Chengquan Zhang, Han Hu, Ming Yan, Benyou Wang

PDF

1 Repo

TL;DR

This paper introduces MyPhoneBench, a framework to evaluate whether mobile phone agents respect user privacy during benign tasks, revealing that current models often over-disclose personal data despite task success.

Contribution

The paper operationalizes privacy-respecting behavior in mobile agents and provides a verifiable evaluation framework with mock apps and rule-based auditing.

Findings

01

No single model dominates in privacy and task success.

02

Joint evaluation of success and privacy reshuffles model rankings.

03

Agents frequently fill optional personal entries, violating data minimization.

Abstract

We study whether phone-use agents respect privacy while completing benign mobile tasks. This question has remained hard to answer because privacy-compliant behavior is not operationalized for phone-use agents, and ordinary apps do not reveal exactly what data agents type into which form entries during execution. To make this question measurable, we introduce MyPhoneBench, a verifiable evaluation framework for privacy behavior in mobile agents. We operationalize privacy-respecting phone use as permissioned access, minimal disclosure, and user-controlled memory through a minimal privacy contract, iMy, and pair it with instrumented mock apps plus rule-based auditing that make unnecessary permission requests, deceptive re-disclosure, and unnecessary form filling observable and reproducible. Across five frontier models on 10 mobile apps and 300 tasks, we find that task success,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

FreedomIntelligence/MyPhoneBench
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.