R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Joonhyung Park; Peng Tang; Sagnik Das; Srikar Appalaraju; Kunwar Yashraj Singh; R. Manmatha; Shabnam Ghadar

arXiv:2507.05673·cs.CV·July 9, 2025

R-VLM: Region-Aware Vision Language Model for Precise GUI Grounding

Joonhyung Park, Peng Tang, Sagnik Das, Srikar Appalaraju, Kunwar Yashraj Singh, R. Manmatha, Shabnam Ghadar

PDF

Open Access

TL;DR

R-VLM introduces a region-aware vision language model with IoU-aware training for precise GUI element grounding, significantly improving accuracy over existing methods across multiple benchmarks.

Contribution

It presents a novel approach combining region proposals and IoU-aware loss to enhance GUI grounding accuracy in vision language models.

Findings

01

Achieves 13% higher grounding accuracy on GUI benchmarks.

02

Improves GUI navigation task accuracy by up to 9.7%.

03

Bridges vision language models with object detection techniques.

Abstract

Visual agent models for automating human activities on Graphical User Interfaces (GUIs) have emerged as a promising research direction, driven by advances in large Vision Language Models (VLMs). A critical challenge in GUI automation is the precise grounding of interface elements across diverse platforms. Existing vision-only GUI agents directly ground elements from large and cluttered screenshots, requiring them to process substantial irrelevant information that compromises their accuracy. In addition, these approaches typically employ basic cross-entropy loss for learning grounding objectives, which fails to effectively capture grounding quality compared to established object detection metrics like Intersection-over-Union (IoU). To address these issues, we introduce R-VLM, a novel GUI grounding approach that leverages zoomed-in region proposals for precise element localization. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Social Robot Interaction and HRI