ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Dawar Khan; Alexandre Kouyoumdjian; Xinyu Liu; Omar Mena; Dominik Engel; Ivan Viola

arXiv:2604.04905·cs.CV·April 7, 2026

ClickAIXR: On-Device Multimodal Vision-Language Interaction with Real-World Objects in Extended Reality

Dawar Khan, Alexandre Kouyoumdjian, Xinyu Liu, Omar Mena, Dominik Engel, Ivan Viola

PDF

1 Repo

TL;DR

ClickAIXR is an on-device multimodal XR interaction framework that enables precise object selection and natural language querying, enhancing privacy and reducing latency compared to cloud-based systems.

Contribution

It introduces a novel on-device vision-language interaction system with click-based object selection in XR, improving privacy, transparency, and user experience.

Findings

01

User study shows acceptable latency and usability.

02

On-device inference enhances privacy and reduces reliance on cloud services.

03

System effectively combines click-based selection with natural language understanding.

Abstract

We present ClickAIXR, a novel on-device framework for multimodal vision-language interaction with objects in extended reality (XR). Unlike prior systems that rely on cloud-based AI (e.g., ChatGPT) or gaze-based selection (e.g., GazePointAR), ClickAIXR integrates an on-device vision-language model (VLM) with a controller-based object selection paradigm, enabling users to precisely click on real-world objects in XR. Once selected, the object image is processed locally by the VLM to answer natural language questions through both text and speech. This object-centered interaction reduces ambiguity inherent in gaze- or voice-only interfaces and improves transparency by performing all inference on-device, addressing concerns around privacy and latency. We implemented ClickAIXR in the Magic Leap SDK (C API) with ONNX-based local VLM inference. We conducted a user study comparing ClickAIXR with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

null
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.