HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

Shaojie Zhang; Pei Fu; Ruoceng Zhang; Jiahui Yang; Anan Du; Xiuwen Xi; Shaokang Wang; Ying Huang; Bin Qin; Zhenbo Luo; Jian Luan

arXiv:2510.27266·cs.CV·November 3, 2025

HyperClick: Advancing Reliable GUI Grounding via Uncertainty Calibration

Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, Jian Luan

PDF

Open Access 5 Reviews

TL;DR

HyperClick improves GUI grounding reliability by calibrating model confidence, combining accuracy with spatial confidence modeling, and enabling self-criticism, leading to state-of-the-art results in GUI automation tasks.

Contribution

It introduces HyperClick, a novel framework that calibrates confidence in GUI grounding models, enhancing reliability and self-awareness in GUI automation.

Findings

01

Achieves state-of-the-art performance on seven benchmarks.

02

Provides well-calibrated confidence estimates.

03

Reduces overconfidence in GUI grounding models.

Abstract

Autonomous Graphical User Interface (GUI) agents rely on accurate GUI grounding, which maps language instructions to on-screen coordinates, to execute user commands. However, current models, whether trained via supervised fine-tuning (SFT) or reinforcement fine-tuning (RFT), lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. We first systematically evaluate probabilistic and verbalized confidence in general and GUI-specific models, revealing a misalignment between confidence and actual accuracy, which is particularly critical in dynamic GUI automation tasks, where single errors can cause task failure. To address this, we propose HyperClick, a novel framework that enhances reliable GUI grounding through uncertainty calibration. HyperClick introduces a dual reward mechanism, combining a binary reward for correct actions with a…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 4

Strengths

Significance: At line 370, Table 1 shows SOTA performance of the 7B model on 6 of 7 benchmarks, and second-best performance on the remaining benchmark. Unfortunately, it is not clear to me that this is because of the paper's contribution. It might just be the strength of the underlying model (see questions in later section). Table 2 and Figure 3 suggest that the model's predicted confidences are informative and could be useful downstream. Originality: It's a nice idea to use some kind of unc

Weaknesses

The method feels ad hoc. Why not start with ordinary uncertainty quantification methods? The paper's terminology and writing are often imprecise. I don't think c is really a "confidence" in the usual sense. I don't think a low L2 error in predicting c is really "calibration" in the usual sense. (Calibration would involve mapping the predicted c values to actual error rates on the primary task, e.g., using isotonic regression or Platt scaling.) Nor should that L2 error be called a Brier sc

Reviewer 02Rating 4Confidence 3

Strengths

- Well-Posed and Important Problem. GUI grounding is a key task, and overconfidence is a key obstacle. - Extensive and sound experiments. Many ablations on the algorithm and models are conducted to verify the design of HyperClick.

Weaknesses

- Unclear contribution of modules. Though Table 3 and 4 list some ablations of the algorithm, they showed very small differences, especially in the reward configurations. The key contribution of this work, the confidence reward, only brings a marginal 0.5% improvement. - The confidence assumption: HyperClick assumes the model to be the most confident when predicting the center of the bounding box, and decays in a Gaussian function. Some data labels (especially human-annotated ones) may not be a

Reviewer 03Rating 2Confidence 2

Strengths

- The motivation for this work is strong, modeling uncertainty in GUI grounding is an important problem, especially as GUI agents are given more access.

Weaknesses

- The motivation for additional datasets (MMG, I2E, CAG, UIV) is not clear, especially given that the authors have only reported results for 2/36 RFT model-dataset pairs (which seem to outperform SFT models on average). - It is unclear whether the proposed approach outperforms GUI-G2. GUI-G2 is only evaluated on three of the seven datasets and their performance is within a percent on these three datasets. - Additionally, one of the main benefits of the confidence loss seems to be the improved ca

Reviewer 04Rating 6Confidence 3

Strengths

1. GUI grounding is the fundamental GUI adaption for GUI agents that enables them to identify GUI elements for specific user command. However, current models lack self-awareness of their capability boundaries, leading to overconfidence and unreliable predictions. Enhancing the reliability for GUI grounding is critical for the robustness of GUI agents. 2. The proposed HyperClick introduces correctness reward and confidence reward, which are designed to jointly optimize grounding accuracy and conf

Weaknesses

1. The utilization of a Gaussian representation was previously introduced in GUI-G2, which makes parts of this methodology appear similar. A more explicit differentiation from GUI-G2 is needed to better clarify the novelty of HyperClick. 2. The ablation in Table 3 omits the vanilla baseline where only $R_{format}$ is adopted, which may provide a better understanding of the impact of the proposed methods. 3. The 3B models surprisingly outperforms 7B models in Table 2, which is counterintuitive. F

Reviewer 05Rating 4Confidence 4

Strengths

1. The main result benchmarks are sufficient, even including CAGUI and MMBenchGUI which are not so widely used. 2. The performance achieves SOTA among several open-source models reported. 3. I personally like the starting point of this paper: using Probabilistic Confidence and Verbalized Confidence as preliminary studies.

Weaknesses

1. About **Novelty**. The biggest weakness lies in novelty. This paper incorporates Gaussian–based term in GRPO rewards to reduce overconfidence. However, the **confidence** and **Gaussian–based reward** are not so timely now. For **confidence**, Visual-RFT [1] aims at reducing overconfidence by introducing $R_{conf}$. The ideas are similar with HyperClick: for successfully matched boxes, the higher the confidence, the better. For **Gaussian-based reward**, GUI-G$^2$ [2] first proposes Gaussian

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Multimodal Machine Learning Applications · Topic Modeling