TL;DR
This paper introduces a multi-turn, visual feedback-driven approach for GUI grounding in coding environments, significantly improving precision over single-shot methods through iterative refinement.
Contribution
It presents a novel iterative refinement mechanism for GUI grounding that enhances accuracy and robustness in dense, dynamic coding interfaces.
Findings
Multi-turn refinement outperforms single-shot models in click accuracy.
The approach adapts to dynamic UI changes through visual feedback.
Significant improvements in task success rate across multiple benchmarks.
Abstract
Computer Use Agents (CUAs) fundamentally rely on graphical user interface (GUI) grounding to translate language instructions into executable screen actions, but editing-level grounding in dense coding interfaces, where sub-pixel accuracy is required to interact with dense IDE elements, remains underexplored. Existing approaches typically rely on single-shot coordinate prediction, which lacks a mechanism for error correction and often fails in high-density interfaces. In this technical report, we conduct an empirical study of pixel-precise cursor localization in coding environments. Instead of a single-step execution, our agent engages in an iterative refinement process, utilizing visual feedback from previous attempts to reach the target element. This closed-loop grounding mechanism allows the agent to self-correct displacement errors and adapt to dynamic UI changes. We evaluate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
