S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan; Zhide Zhong; Jiaguan Zhu; Junjie He; Weilin Yuan; Wenxuan Song; Xin Gong; Yingjie Cai; Guanyi Zhao; Xu Yan; Bingbing Liu; Ying-Cong Chen; Haoang Li

arXiv:2603.16195·cs.CV·March 19, 2026

S-VAM: Shortcut Video-Action Model by Self-Distilling Geometric and Semantic Foresight

Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, Haoang Li

PDF

Open Access

TL;DR

S-VAM introduces a real-time, high-fidelity video action prediction model that uses self-distillation to efficiently generate coherent geometric and semantic foresight from a single forward pass, improving robot manipulation tasks.

Contribution

The paper proposes a novel shortcut video-action model with a self-distillation strategy that condenses multi-step generative priors into one-step inference, enabling efficient and accurate action prediction.

Findings

01

Outperforms state-of-the-art methods in simulation and real-world tasks.

02

Enables real-time inference with high-fidelity foresight.

03

Improves robotic manipulation in complex environments.

Abstract

Video action models (VAMs) have emerged as a promising paradigm for robot learning, owing to their powerful visual foresight for complex manipulation tasks. However, current VAMs, typically relying on either slow multi-step video generation or noisy one-step feature extraction, cannot simultaneously guarantee real-time inference and high-fidelity foresight. To address this limitation, we propose S-VAM, a shortcut video-action model that foresees coherent geometric and semantic representations via a single forward pass. Serving as a stable blueprint, these foreseen representations significantly simplify the action prediction. To enable this efficient shortcut, we introduce a novel self-distillation strategy that condenses structured generative priors of multi-step denoising into one-step inference. Specifically, vision foundation model (VFM) representations extracted from the diffusion…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobot Manipulation and Learning · Generative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition