JARVIS: A Just-in-Time Augmented Reality VLM-Powered Instruction System for Cross-Reality Task Guidance

Yusi Sun; Ying Jiang; Jiayin Lu; Yin yang; Yong-Hong Kuo; Chenfanfu Jiang

arXiv:2604.10108·cs.HC·May 19, 2026

JARVIS: A Just-in-Time Augmented Reality VLM-Powered Instruction System for Cross-Reality Task Guidance

Yusi Sun, Ying Jiang, Jiayin Lu, Yin yang, Yong-Hong Kuo, Chenfanfu Jiang

PDF

TL;DR

JARVIS is an AR instruction system powered by vision-language models that provides real-time, context-aware guidance for hybrid physical-virtual tasks, improving user performance and experience.

Contribution

The paper introduces JARVIS, a novel VLM-driven AR system that generates adaptive, step-by-step guidance for cross-reality tasks, addressing limitations of prior systems.

Findings

01

JARVIS enhances usability and reduces workload in cross-reality tasks.

02

The system achieves higher success rates and better visualization effectiveness.

03

A formative study categorizes guidance needs into four cross-reality types.

Abstract

Many everyday tasks rely on external tutorials such as manuals and videos, requiring users to constantly switch between reading instructions and performing actions, which disrupts workflow and increases cognitive load. Augmented reality (AR) enables in-situ guidance, while recent advances in large language models (LLMs) and vision-language models (VLMs) make it possible to automatically generate such guidance. However, existing AI-powered AR tutorial systems primarily focus on physical procedural tasks and provide limited support for hybrid physical and virtual workspaces. To address this gap, we conduct a formative study of cross-reality tasks and identify key requirements for state awareness and cross-reality coordination. We present JARVIS, a VLM-driven AR instruction system that generates contextual, step-by-step guidance from a single prompt, with real-time state verification and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.