AmbiBench: Benchmarking Mobile GUI Agents Beyond One-Shot Instructions in the Wild
Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, Jiachen Yang, Boyu Yang, Jiacheng Liu, Xin Peng

TL;DR
AmbiBench introduces a new benchmark for mobile GUI agents that evaluates their ability to interpret and clarify ambiguous instructions through bidirectional interaction, advancing beyond traditional one-shot instruction following.
Contribution
This paper presents AmbiBench, the first benchmark incorporating instruction clarity taxonomy and a multi-agent evaluation framework to assess intent alignment in mobile GUI agents.
Findings
State-of-the-art agents struggle with ambiguous instructions.
Active clarification significantly improves task success.
MUSE correlates strongly with human judgment.
Abstract
Benchmarks are paramount for gauging progress in the domain of Mobile GUI Agents. In practical scenarios, users frequently fail to articulate precise directives containing full task details at the onset, and their expressions are typically ambiguous. Consequently, agents are required to converge on the user's true intent via active clarification and interaction during execution. However, existing benchmarks predominantly operate under the idealized assumption that user-issued instructions are complete and unequivocal. This paradigm focuses exclusively on assessing single-turn execution while overlooking the alignment capability of the agent. To address this limitation, we introduce AmbiBench, the first benchmark incorporating a taxonomy of instruction clarity to shift evaluation from unidirectional instruction following to bidirectional intent alignment. Grounded in Cognitive Gap…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsPersonal Information Management and User Behavior · Advanced Software Engineering Methodologies · Usability and User Interface Design
