Grounded GUI Understanding for Vision-Based Spatial Intelligent Agent: Exemplified by Extended Reality Apps
Shuqing Li, Binchang Li, Yepang Liu, Cuiyun Gao, Jianping Zhang, Shing-Chi Cheung, Michael R. Lyu

TL;DR
This paper introduces Orienter, a novel zero-shot framework for detecting interactable GUI elements in XR apps by understanding semantic context and iteratively refining detection, addressing challenges of heterogeneity and open-vocabulary categories.
Contribution
The paper presents the first zero-shot, context-sensitive IGE detection framework tailored for XR apps, improving detection accuracy over existing methods.
Findings
Orienter outperforms state-of-the-art detection approaches.
It effectively handles open-vocabulary and heterogeneous IGE categories.
The framework demonstrates robustness in complex XR environments.
Abstract
In recent years, spatial computing a.k.a. Extended Reality (XR) has emerged as a transformative technology, offering users immersive and interactive experiences across diversified virtual environments. Users can interact with XR apps through interactable GUI elements (IGEs) on the stereoscopic three-dimensional (3D) graphical user interface (GUI). The accurate recognition of these IGEs is instrumental, serving as the foundation of many software engineering tasks, including automated testing and effective GUI search. The most recent IGE detection approaches for 2D mobile apps typically train a supervised object detection model based on a large-scale manually-labeled GUI dataset, usually with a pre-defined set of clickable GUI element categories like buttons and spinners. Such approaches can hardly be applied to IGE detection in XR apps, due to a multitude of challenges including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAugmented Reality Applications · Robotics and Automated Systems · Context-Aware Activity Recognition Systems
