Improved GUI Grounding via Iterative Narrowing

Anthony Nguyen

arXiv:2411.13591·cs.CV·September 12, 2025

Improved GUI Grounding via Iterative Narrowing

Anthony Nguyen

PDF

Open Access 1 Repo

TL;DR

This paper presents an iterative narrowing visual prompting framework that significantly improves GUI grounding performance of vision-language models across diverse UI platforms.

Contribution

We introduce a novel visual prompting method with iterative narrowing to enhance GUI grounding in both general and fine-tuned models.

Findings

01

Improved GUI grounding accuracy across multiple UI platforms.

02

Effective enhancement for both general and fine-tuned vision-language models.

03

Open-source code for reproducibility.

Abstract

Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ant-8/GUI-Grounding-via-Iterative-Narrowing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIoT-based Smart Home Systems · Robotics and Sensor-Based Localization · Advanced Vision and Imaging