Beyond Pass or Fail: Multi-Dimensional Benchmarking of Foundation Models for Goal-based Mobile UI Navigation
Dezhi Ran, Mengzhou Wu, Hao Yu, Yuetong Li, Jun Ren, Yuan Cao, Xia, Zeng, Haochuan Lu, Zexin Xu, Mengqian Xu, Ting Su, Liangchao Yao, Ting Xiong,, Wei Yang, Yuetang Deng, Assaf Marron, David Harel, Tao Xie

TL;DR
This paper introduces Sphinx, a multi-dimensional benchmark for evaluating foundation models in goal-based mobile UI navigation, revealing their limitations and guiding future improvements.
Contribution
It presents Sphinx, a comprehensive toolkit for detailed evaluation of foundation models in real-world mobile UI navigation tasks, addressing limitations of existing benchmarks.
Findings
FMs struggle with goal-based UI navigation tasks.
Existing FMs have deficiencies in understanding app knowledge and planning.
Benchmarking reveals specific failure modes of FMs.
Abstract
Recent advances of foundation models (FMs) have made navigating mobile applications (apps) based on high-level goal instructions within reach, with significant industrial applications such as UI testing. While existing benchmarks evaluate FM-based UI navigation using the binary pass/fail metric, they have two major limitations: they cannot reflect the complex nature of mobile UI navigation where FMs may fail for various reasons (e.g., misunderstanding instructions and failed planning), and they lack industrial relevance due to oversimplified tasks that poorly represent real-world scenarios. To address the preceding limitations, we propose Sphinx, a comprehensive benchmark for multi-dimensional evaluation of FMs in industrial settings of UI navigation. Sphinx introduces a specialized toolkit that evaluates five essential FM capabilities, providing detailed insights into failure modes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInteractive and Immersive Displays · Context-Aware Activity Recognition Systems · Augmented Reality Applications
