What's Missing in Screen-to-Action? Towards a UI-in-the-Loop Paradigm for Multimodal GUI Reasoning
Songze Li, Xiaoke Guo, Tianqi Liu, Biao Yi, Zhaoyan Gong, Zhiqiang Liu, Huajun Chen, Wen Zhang

TL;DR
This paper introduces UILoop, a novel UI-in-the-Loop paradigm utilizing multimodal large language models for improved, interpretable GUI reasoning and understanding, supported by a new benchmark dataset.
Contribution
The paper proposes a cyclic Screen-UI-Action paradigm, enabling explicit learning of UI element functions and localization, and introduces a new UI comprehension benchmark with 26K samples.
Findings
UILoop achieves state-of-the-art UI understanding performance.
UILoop outperforms existing methods in GUI reasoning tasks.
The benchmark enables comprehensive evaluation of UI element mastery.
Abstract
Existing Graphical User Interface (GUI) reasoning tasks remain challenging, particularly in UI understanding. Current methods typically rely on direct screen-based decision-making, which lacks interpretability and overlooks a comprehensive understanding of UI elements, ultimately leading to task failure. To enhance the understanding and interaction with UIs, we propose an innovative GUI reasoning paradigm called UI-in-the-Loop (UILoop). Our approach treats the GUI reasoning task as a cyclic Screen-UI elements-Action process. By enabling Multimodal Large Language Models (MLLMs) to explicitly learn the localization, semantic functions, and practical usage of key UI elements, UILoop achieves precise element discovery and performs interpretable reasoning. Furthermore, we introduce a more challenging UI Comprehension task centered on UI elements with three evaluation metrics.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
