See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations
Guangyan Chen, Meiling Wang, Qi Shao, Zichen Zhou, Weixin Mao, Te Cui, Minzhao Zhu, Yinan Deng, Luojie Yang, Zhanqi Zhang, Yi Yang, Hua Chen, and Yufeng Yue

TL;DR
ViVLA is a robotic manipulation model that learns new tasks from a single demonstration video by jointly processing expert videos and robot observations, significantly improving generalization to unseen tasks.
Contribution
The paper introduces ViVLA, a novel one-shot learning approach for robot manipulation that leverages a large-scale synthetic dataset and joint video processing for improved task generalization.
Findings
Achieves over 30% improvement on unseen LIBERO tasks.
Maintains above 35% gains with cross-embodiment videos.
Demonstrates effective real-world learning from human videos.
Abstract
Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose ViVLA, a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, effectively distilling fine-grained manipulation knowledge from expert…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI
