TL;DR
This paper introduces zoom consistency, a geometric confidence signal in multi-step visual grounding pipelines, which correlates with prediction correctness and can improve model routing decisions.
Contribution
It demonstrates that zoom consistency is a useful, calibration-free confidence measure derived from intermediate outputs in multi-step visual grounding models.
Findings
Zoom consistency correlates with prediction correctness (AUC up to 0.60).
It can be used to route between models, improving accuracy by 0.8%.
The measure is a linear estimator of spatial error under ideal conditions.
Abstract
Multi-step zoom-in pipelines are widely used for GUI grounding, yet the intermediate predictions they produce are typically discarded after coordinate remapping. We observe that these intermediate outputs contain a useful confidence signal for free: zoom consistency, the distance between a model's step-2 prediction and the crop center. Unlike log-probabilities or token-level uncertainty, zoom consistency is a geometric quantity in a shared coordinate space, making it directly comparable across architecturally different VLMs without calibration. We prove this quantity is a linear estimator of step-1 spatial error under idealized conditions (perfect step-2, target within crop) and show it correlates with prediction correctness across two VLMs (AUC = 0.60; Spearman rho = -0.14, p < 10^{-6} for KV-Ground-8B; rho = -0.11, p = 0.0003 for Qwen3.5-27B). The correlation is small but consistent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
