MVP: Multiple View Prediction Improves GUI Grounding
Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

TL;DR
This paper introduces Multi-View Prediction (MVP), a training-free framework that improves GUI grounding accuracy by aggregating multiple cropped views, significantly enhancing stability and performance across various models and benchmarks.
Contribution
MVP is a novel, training-free approach that leverages multi-view inference with attention-guided view proposals and clustering to stabilize coordinate predictions in GUI grounding.
Findings
MVP improves accuracy across multiple models and benchmarks.
MVP significantly boosts UI-TARS-1.5-7B to 56.1%.
MVP enhances stability against visual perturbations.
Abstract
GUI grounding, which translates natural language instructions into precise pixel coordinates, is essential for developing practical GUI agents. However, we observe that existing grounding models exhibit significant coordinate prediction instability, minor visual perturbations (e.g. cropping a few pixels) can drastically alter predictions, flipping results between correct and incorrect. This instability severely undermines model performance, especially for samples with high-resolution and small UI elements. To address this issue, we propose Multi-View Prediction (MVP), a training-free framework that enhances grounding performance through multi-view inference. Our key insight is that while single-view predictions may be unstable, aggregating predictions from multiple carefully cropped views can effectively distinguish correct coordinates from outliers. MVP comprises two components: (1)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Advanced Neural Network Applications
