Ponder & Press: Advancing Visual GUI Agent towards General Computer Control
Yiqin Wang, Haoji Zhang, Jingqi Tian, Yansong Tang

TL;DR
Ponder & Press introduces a visual-only GUI agent framework that combines large language models to accurately interpret user instructions and locate GUI elements, enabling versatile and human-like computer control across diverse environments.
Contribution
The paper presents a novel divide-and-conquer framework using multimodal large language models for visual GUI control, outperforming existing models in GUI grounding and interaction benchmarks.
Findings
Outperforms existing models by +22.5% on ScreenSpot GUI grounding benchmark.
Achieves state-of-the-art performance across web, desktop, and mobile UIs.
Demonstrates versatile, human-like interaction using only visual input.
Abstract
Most existing GUI agents typically depend on non-vision inputs like HTML source code or accessibility trees, limiting their flexibility across diverse software environments and platforms. Current multimodal large language models (MLLMs), which excel at using vision to ground real-world objects, offer a potential alternative. However, they often struggle with accurately localizing GUI elements -- a critical requirement for effective GUI automation -- due to the semantic gap between real-world objects and GUI elements. In this work, we introduce Ponder & Press, a divide-and-conquer framework for general computer control using only visual input. Our approach combines an general-purpose MLLM as an 'interpreter', responsible for translating high-level user instructions into detailed action descriptions, with a GUI-specific MLLM as a 'locator' that precisely locates GUI elements for action…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Robotics and Automated Systems
