Zoom to Essence: Trainless GUI Grounding by Inferring upon Interface Elements
Ziwei Liu, Tao Feng, Borui Kang, Yanbing Yang, Jun Luo

TL;DR
ZoomUI is a novel method that enables GUI element grounding in multimodal language models without training, by iteratively focusing on interface regions to understand instructions better.
Contribution
It introduces a training-free inference scaling approach that decomposes complex UI understanding into visual element recognition using common MLLMs.
Findings
ZoomUI achieves state-of-the-art performance on multiple benchmarks.
It reduces reliance on large annotated datasets for GUI grounding.
ZoomUI surpasses existing fine-tuned models in accuracy.
Abstract
Multimodal Large Language Model (MLLM)-based Graphical User Interface (GUI) agents develop rapidly, with visual grounding that maps natural language instructions to target UI elements serving as the core capability. Existing GUI agents typically fine-tune MLLM on massive datasets to handle challenges in understanding instructions and UI interfaces, which not only incurs high data annotation costs but also makes performance dependent on data quality and distribution. To avoid such cumbersome yet ineffective training, we notice that complex UI interfaces can be decomposed into basic visual elements directly understandable by common MLLMs. Consequently, we propose ZoomUI that leverages inference scaling to guide common MLLMs in progressively anchor instruction elements to increasingly detailed interface elements. Specifically, ZoomUI first optimizes the latent thinking to transform…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
