AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

Jiazheng Sun; Mingxuan Li; Yingying Zhang; Jiayang Niu; Yachen Wu; Ruihan Jin; Shuyu Lei; Pengrongrui Tan; Zongyu Zhang; Ruoyi Wang; Jiachen Yang; Boyu Yang; Jiacheng Liu; Xin Peng

arXiv:2602.11750·cs.SE·February 13, 2026

AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild

Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, Jiachen Yang, Boyu Yang, Jiacheng Liu, Xin Peng

PDF

Open Access

TL;DR

AmbiBench introduces a new benchmark for mobile GUI agents that evaluates their ability to interpret and clarify ambiguous instructions through bidirectional interaction, advancing beyond traditional one-shot instruction following.

Contribution

This paper presents AmbiBench, the first benchmark incorporating instruction clarity taxonomy and a multi-agent evaluation framework to assess intent alignment in mobile GUI agents.

Findings

01

State-of-the-art agents struggle with ambiguous instructions.

02

Active clarification significantly improves task success.

03

MUSE correlates strongly with human judgment.

Abstract

Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPersonal Information Management and User Behavior · Advanced Software Engineering Methodologies · Usability and User Interface Design