V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking
Jikai Chen, Long Chen, Dong Wang, Qinglin Su, Zhixuan Chu, Bingguang Hao, Leilei Gan, Chenyi Zhuang, Jinjie Gu

TL;DR
V2P introduces a novel attention calibration method for GUI element localization, combining background suppression and center-peaking to improve accuracy in GUI grounding tasks.
Contribution
The paper proposes V2P, a new approach that enhances GUI element localization by addressing background distraction and center-edge distinction through attention suppression and Gaussian modeling.
Findings
Achieves 92.4% and 52.5% accuracy on two benchmarks.
Component ablations confirm effectiveness of each V2P element.
Demonstrates generalizability and potential for real-world GUI agents.
Abstract
Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform modeling the target UI element fails to distinguish between its center and edges, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsContext-Aware Activity Recognition Systems · Gaze Tracking and Assistive Technology
