S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight
Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li

TL;DR
S-VAM introduces a real-time, high-fidelity video action prediction model that uses self-distillation to efficiently generate coherent geometric and semantic foresight from a single forward pass, improving robot manipulation tasks.
Contribution
The paper proposes a novel shortcut video-action model with a self-distillation strategy that condenses multi-step generative priors into one-step inference, enabling efficient and accurate action prediction.
Findings
Outperforms state-of-the-art methods in simulation and real-world tasks.
Enables real-time inference with high-fidelity foresight.
Improves robotic manipulation in complex environments.
Abstract
Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition
