OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments
Felix Henry, Xiaochen Lin, Jiangyou Zhu, Yangfan, Bingqian Zhang, Min Chen, and Shiyu Huang

TL;DR
OmniGUI introduces a comprehensive benchmark for GUI agents in smartphone environments that require processing multimodal inputs like images, audio, and video, addressing limitations of static screenshot-based benchmarks.
Contribution
This work presents the first step-level, multimodal GUI benchmark with a dataset, evaluation pipeline, and initial baselines for omni-modal smartphone environments.
Findings
Current models perform well on static visual tasks but struggle with temporal and audio cues.
Action prediction degrades significantly in environments with synchronous multimodal signals.
Cross-modal interference from environmental noise hampers agent performance.
Abstract
Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
