Chain-of-Ground: Improving GUI Grounding via Iterative Reasoning and Reference Feedback
Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, Shilong Liu

TL;DR
This paper introduces Chain of Ground, a training-free iterative reasoning framework for GUI grounding that improves accuracy and interpretability by refining hypotheses, demonstrating effectiveness on benchmarks and real-world datasets.
Contribution
It proposes a novel multi-step, training-free grounding method using large language models for iterative refinement, enhancing accuracy and generalization in GUI localization tasks.
Findings
Achieves 68.4% accuracy on ScreenSpot Pro, 4.8 points improvement.
Improves over baseline Qwen3 VL 235B by 6.9 points on TPanel UI.
Highlights the potential of structured iterative refinement for grounding.
Abstract
GUI grounding aims to align natural language instructions with precise regions in complex user interfaces. Advanced multimodal large language models show strong ability in visual GUI grounding but still struggle with small or visually similar targets and ambiguity in real world layouts. These limitations arise from limited grounding capacity and from underuse of existing reasoning potential. We present Chain of Ground CoG a training free multi step grounding framework that uses multimodal large language models for iterative visual reasoning and refinement. Instead of direct prediction the model progressively reflects and adjusts its hypotheses leading to more accurate and interpretable localization. Our approach achieves 68.4 accuracy on the ScreenSpot Pro benchmark an improvement of 4.8 points. To measure real world generalization we introduce TPanel UI a dataset of 420 labeled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
