See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Guangyan Chen; Meiling Wang; Qi Shao; Zichen Zhou; Weixin Mao; Te Cui; Minzhao Zhu; Yinan Deng; Luojie Yang; Zhanqi Zhang; Yi Yang; Hua Chen; and Yufeng Yue

arXiv:2512.07582·cs.RO·December 9, 2025

See Once, Then Act: Vision-Language-Action Model with Task Learning from One-Shot Video Demonstrations

Guangyan Chen, Meiling Wang, Qi Shao, Zichen Zhou, Weixin Mao, Te Cui, Minzhao Zhu, Yinan Deng, Luojie Yang, Zhanqi Zhang, Yi Yang, Hua Chen, and Yufeng Yue

PDF

Open Access

TL;DR

ViVLA is a robotic manipulation model that learns new tasks from a single demonstration video by jointly processing expert videos and robot observations, significantly improving generalization to unseen tasks.

Contribution

The paper introduces ViVLA, a novel one-shot learning approach for robot manipulation that leverages a large-scale synthetic dataset and joint video processing for improved task generalization.

Findings

01

Achieves over 30% improvement on unseen LIBERO tasks.

02

Maintains above 35% gains with cross-embodiment videos.

03

Demonstrates effective real-world learning from human videos.

Abstract

Developing robust and general-purpose manipulation policies represents a fundamental objective in robotics research. While Vision-Language-Action (VLA) models have demonstrated promising capabilities for end-to-end robot control, existing approaches still exhibit limited generalization to tasks beyond their training distributions. In contrast, humans possess remarkable proficiency in acquiring novel skills by simply observing others performing them once. Inspired by this capability, we propose ViVLA, a generalist robotic manipulation policy that achieves efficient task learning from a single expert demonstration video at test time. Our approach jointly processes an expert demonstration video alongside the robot's visual observations to predict both the demonstrated action sequences and subsequent robot actions, effectively distilling fine-grained manipulation knowledge from expert…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robot Manipulation and Learning · Social Robot Interaction and HRI