Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Shuqing Li; Binchang Li; Yepang Liu; Cuiyun Gao; Jianping Zhang; Shing-Chi Cheung; Michael R. Lyu

arXiv:2409.10811·cs.SE·October 2, 2025

Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps

Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu

PDF

Open Access

TL;DR

This paper introduces Orienter, a novel zero-shot framework for detecting interactable GUI elements in XR apps by understanding semantic context and iteratively refining detection, addressing challenges of heterogeneity and open-vocabulary categories.

Contribution

The paper presents the first zero-shot, context-sensitive IGE detection framework tailored for XR apps, improving detection accuracy over existing methods.

Findings

01

Orienter outperforms state-of-the-art detection approaches.

02

It effectively handles open-vocabulary and heterogeneous IGE categories.

03

The framework demonstrates robustness in complex XR environments.

Abstract

In recent years, spatial computing a.k.a. Extended Reality (XR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with XR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in XR apps, due to a multitude of challenges including…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAugmented Reality Applications · Robotics and Automated Systems · Context-Aware Activity Recognition Systems