VISTA: A Generative Egocentric Video Framework for Daily Assistance
Yu-Hsiang Liu, Yu-Chien Tang, An-Zi Yen

TL;DR
VISTA is a high-fidelity egocentric video synthesis system designed to generate diverse training data for AI agents assisting with daily activities, addressing data scarcity and safety concerns.
Contribution
It introduces a novel 5-step script generation pipeline with causal reasoning to produce customizable, realistic egocentric videos for proactive and reactive assistance scenarios.
Findings
Generates diverse, logically grounded egocentric videos for training.
Supports proactive and reactive assistance modes with user customization.
Provides a scalable alternative to real-world data collection.
Abstract
Training AI agents to proactively assist humans in daily activities, from routine household tasks to urgent safety situations, requires large-scale visual data. However, capturing such scenarios in the real world is often difficult, costly, or unsafe, and physics-based simulators lack the visual fidelity needed to transfer learned behaviors to real settings. Therefore, we introduce VISTA, a video synthesis system that produces high-fidelity egocentric videos as training and evaluation data for AI agents. VISTA employs a 5-step script generation pipeline with causal reverse reasoning to create diverse, logically grounded intervention modes. These scenarios span two levels of agent autonomy: reactive and proactive. In reactive modes, the user explicitly asks the agent for help. In proactive modes, the agent offers help without receiving a direct request. We further divide proactive modes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
