TL;DR
AIvaluateXR introduces a comprehensive framework for benchmarking large language models on XR devices, evaluating performance, efficiency, and accuracy to guide optimal deployment strategies.
Contribution
The paper presents a novel evaluation framework and a unified method for assessing LLMs on XR hardware, including benchmarking results across multiple devices and models.
Findings
Performance varies significantly across device-model pairs.
The framework identifies optimal trade-offs between quality and speed.
On-device LLMs show competitive efficiency compared to cloud-based setups.
Abstract
The deployment of large language models (LLMs) on extended reality (XR) devices has great potential to advance the field of human-AI interaction. In the case of direct, on-device model inference, selecting the appropriate model and device for specific tasks remains challenging. In this paper, we present AIvaluateXR, a comprehensive evaluation framework for benchmarking LLMs running on XR devices. To demonstrate the framework, we deploy 17 selected LLMs across four XR platforms: Magic Leap 2, Meta Quest 3, Vivo X100s Pro, and Apple Vision Pro, and conduct an extensive evaluation. Our experimental setup measures four key metrics: performance consistency, processing speed, memory usage, and battery consumption. For each of the 68 model-device pairs, we assess performance under varying string lengths, batch sizes, and thread counts, analyzing the trade-offs for real-time XR applications. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
