UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Liangyu Chen; Hanzhang Zhou; Chenglin Cai; Jianan Zhang; Panrong Tong; Quyu Kong; Xu Zhang; Chen Liu; Yuqi Liu; Wenxuan Wang; Yue Wang; Qin Jin; Steven Hoi

arXiv:2510.20286·cs.CV·October 24, 2025

UI-Ins: Enhancing GUI Grounding with Multi-Perspective Instruction-as-Reasoning

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, Steven Hoi

PDF

Open Access 2 Models

TL;DR

This paper introduces UI-Ins, a multi-perspective instruction reasoning approach for GUI grounding, significantly improving accuracy and robustness by treating instructions as dynamic analytical pathways and optimizing their selection during inference.

Contribution

It proposes the Instruction-as-Reasoning paradigm with a two-stage training framework, achieving state-of-the-art results and demonstrating emergent reasoning capabilities in GUI grounding models.

Findings

01

UI-Ins models achieve top accuracy on five benchmarks.

02

Instruction diversity exploitation improves performance by up to 76%.

03

Models show strong agentic potential in real-world tasks.

Abstract

GUI grounding, which maps natural-language instructions to actionable UI elements, is a core capability of GUI agents. Prior works largely treats instructions as a static proxy for user intent, overlooking the impact of instruction diversity and quality on grounding performance. Through a careful investigation of existing grounding datasets, we find a 23.3% flaw rate in their instructions and show that inference-time exploitation of instruction diversity yields up to a substantial 76% relative performance improvement. In this paper, we introduce the Instruction-as-Reasoning paradigm, treating instructions as dynamic analytical pathways that offer distinct perspectives and enabling the model to select the most effective pathway during reasoning. To achieve this, we propose a two-stage training framework: supervised fine-tuning (SFT) on synthesized, diverse instructions to instill…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Neural Network Applications