Improved GUI Grounding via Iterative Narrowing
Anthony Nguyen

TL;DR
This paper presents an iterative narrowing visual prompting framework that significantly improves GUI grounding performance of vision-language models across diverse UI platforms.
Contribution
We introduce a novel visual prompting method with iterative narrowing to enhance GUI grounding in both general and fine-tuned models.
Findings
Improved GUI grounding accuracy across multiple UI platforms.
Effective enhancement for both general and fine-tuned vision-language models.
Open-source code for reproducibility.
Abstract
Graphical User Interface (GUI) grounding plays a crucial role in enhancing the capabilities of Vision-Language Model (VLM) agents. While general VLMs, such as GPT-4V, demonstrate strong performance across various tasks, their proficiency in GUI grounding remains suboptimal. Recent studies have focused on fine-tuning these models specifically for zero-shot GUI grounding, yielding significant improvements over baseline performance. We introduce a visual prompting framework that employs an iterative narrowing mechanism to further improve the performance of both general and fine-tuned models in GUI grounding. For evaluation, we tested our method on a comprehensive benchmark comprising various UI platforms and provided the code to reproduce our results.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT-based Smart Home Systems · Robotics and Sensor-Based Localization · Advanced Vision and Imaging
