V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking

Jikai Chen; Long Chen; Dong Wang; Qinglin Su; Zhixuan Chu; Bingguang Hao; Leilei Gan; Chenyi Zhuang; Jinjie Gu

arXiv:2508.13634·cs.AI·January 21, 2026

V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking

Jikai Chen, Long Chen, Dong Wang, Qinglin Su, Zhixuan Chu, Bingguang Hao, Leilei Gan, Chenyi Zhuang, Jinjie Gu

PDF

Open Access 1 Models

TL;DR

V2P introduces a novel attention calibration method for GUI element localization, combining background suppression and center-peaking to improve accuracy in GUI grounding tasks.

Contribution

The paper proposes V2P, a new approach that enhances GUI element localization by addressing background distraction and center-edge distinction through attention suppression and Gaussian modeling.

Findings

01

Achieves 92.4% and 52.5% accuracy on two benchmarks.

02

Component ablations confirm effectiveness of each V2P element.

03

Demonstrates generalizability and potential for real-world GUI agents.

Abstract

Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform modeling the target UI element fails to distinguish between its center and edges, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
inclusionAI/V2P-7B
model· 435 dl· ♡ 8
435 dl♡ 8

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsContext-Aware Activity Recognition Systems · Gaze Tracking and Assistive Technology