V2P: Visual Attention Calibration for GUI Grounding via Background Suppression and Center Peaking
Jikai Chen, Long Chen, Dong Wang, Qinglin Su, Zhixuan Chu, Bingguang Hao, Leilei Gan, Chenyi Zhuang, Jinjie Gu

TL;DR
V2P introduces a novel attention calibration method for GUI element localization, using background suppression and Gaussian heatmaps to improve precision and robustness in GUI grounding tasks.
Contribution
The paper proposes V2P, a new approach that addresses background distraction and center-edge distinction in GUI grounding through suppression attention and Gaussian modeling.
Findings
Achieves 92.4% and 52.5% accuracy on two benchmarks.
Effectively isolates target areas and improves click precision.
Demonstrates generalizability and potential for real-world deployment.
Abstract
Precise localization of GUI elements is crucial for the development of GUI agents. Traditional methods rely on bounding box or center-point regression, neglecting spatial interaction uncertainty and visual-semantic hierarchies. Recent methods incorporate attention mechanisms but still face two key issues: (1) ignoring processing background regions causes attention drift from the desired area, and (2) uniform modeling the target UI element fails to distinguish between its center and edges, leading to click imprecision. Inspired by how humans visually process and interact with GUI elements, we propose the Valley-to-Peak (V2P) method to address these issues. To mitigate background distractions, V2P introduces a suppression attention mechanism that minimizes the model's focus on irrelevant regions to highlight the intended region. For the issue of center-edge distinction, V2P applies a…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
The author's motivation is sound. When using models that use attention mechanisms as inductive biases, the two problems highlighted by the authors can be expected, and the authors have come up with an intuitive and innovative method to solve the problem. The authors conduct a detailed analysis/ablation of where their method excels, like what size, shape of the UI elements the methods provide the most gains, and how different design choices influence their method's performance. These help underst
1. Lack of comparable baselines: The paper is mainly a method paper, and hence, I believe a fair and controlled comparison across different methods is needed to justify the claims that the authors make. To justify the claims, two key baselines are necessary: - SFT-Only Baseline: The base model trained with conventional SFT on the authors' new, filtered dataset. This would establish a proper baseline and clarify how much improvement the proposed method contributes beyond a standard approach on t
- Using a Gaussian distribution to model GUI grounding clicks is quite reasonable and aligns well with human interaction patterns. The inverse-attention penalty also shows a certain degree of innovation. - The writing is clear and easy to follow, and the figures are visually well-designed and polished.
- The explanation for why there is no improvement at all on ScreenSpot-v2 is not convincing. Figure 3(b) shows that the proposed method improves localization for small and medium elements, and ScreenSpot-v2s contains many such elements. The authors should further analyze why the method does not show improvement on ScreenSpot-v2s. - The proposed method is built upon GUI-Actor, but it is unclear why the V2P-baseline performs almost the same as GUI-Actor. The authors should provide results for the
- The idea is interesting and intuitive. Based on the qualitative analysis, the localization results appear accurate. - The improvements on GUI grounding benchmarks are impressive. - The draft is well-organized, and the authors provide both successful and failure cases of their method in the Appendix.
- Experiments are conducted only on GUI grounding benchmarks. It remains unclear whether the proposed method also performs well on GUI agent task benchmarks. Evaluating the approach on such tasks would be important, as user instructions in agent scenarios often do not exactly match the textual labels of GUI elements. It would also help clarify how the attention mechanism behaves when dealing with semantically ambiguous or partially mismatched instructions. - Furthermore, after reviewing Appendi
S1: The proposed V2P method is precisely specified, with both the suppression loss over non-target patches and the Fitts-Gaussian label construction given. S2: On the two evaluated benchmarks, V2P-7B achieves 92.3% on ScreenSpot-v2 and 50.5% on ScreenSpot-Pro, outperforming several GUI baselines. S3: The idea of introducing Background Distraction and Centre-edge Confusion to the GUI grounding task is quite interesting.
W1: The empirical evaluation is restricted to two benchmarks, ScreenSpot-v2 and ScreenSpot-Pro, with no additional benchmarks reported in the main results. Especially, ScreenSpot-v2 is saturated with baselines already achieving >90%. Consequently, the authors might need to include evaluations on diverse benchmarks, such as UI-Vision[1] and OSWorld-G[2], to support the paper’s claims about its applicability to future GUI agents. W2: All experiments are conducted on a single VLM backbone, Qwen2.5
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Gaze Tracking and Assistive Technology
