TL;DR
ClickAIXR is an on-device multimodal XR interaction framework that enables precise object selection and natural language querying, enhancing privacy and reducing latency compared to cloud-based systems.
Contribution
It introduces a novel on-device vision-language interaction system with click-based object selection in XR, improving privacy, transparency, and user experience.
Findings
User study shows acceptable latency and usability.
On-device inference enhances privacy and reduces reliance on cloud services.
System effectively combines click-based selection with natural language understanding.
Abstract
We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
