JARVIS: A Just-in-Time Augmented Reality VLM-Powered Instruction System for Cross-Reality Task Guidance
Yusi Sun, Ying Jiang, Jiayin Lu, Yin yang, Yong-Hong Kuo, Chenfanfu Jiang

TL;DR
JARVIS is an AR instruction system powered by vision-language models that provides real-time, context-aware guidance for hybrid physical-virtual tasks, improving user performance and experience.
Contribution
The paper introduces JARVIS, a novel VLM-driven AR system that generates adaptive, step-by-step guidance for cross-reality tasks, addressing limitations of prior systems.
Findings
JARVIS enhances usability and reduces workload in cross-reality tasks.
The system achieves higher success rates and better visualization effectiveness.
A formative study categorizes guidance needs into four cross-reality types.
Abstract
Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
