Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation
Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, Guangrun Wang

TL;DR
PhysGen leverages pretrained video generation models as proxies for physics simulators, enabling improved robotic manipulation by transferring implicit physical knowledge from videos to control tasks.
Contribution
This work introduces PhysGen, a novel framework that unifies video and action representations for robotic control, utilizing pretrained video models to transfer physical understanding.
Findings
PhysGen outperforms baselines on Libero and ManiSkill benchmarks.
PhysGen matches large-scale action-pretrained models in real-world tasks.
The approach effectively transfers physical knowledge like object permanence from videos.
Abstract
The scarcity of large-scale robotic data has motivated the repurposing of foundation models from other modalities for policy learning. In this work, we introduce PhysGen (Learning Physics from Pretrained Video Generation Models), a scalable continuous and sequential world interaction framework that leverages autoregressive video generation to solve robotic manipulation tasks. By treating the pretrained video model as a proxy for a physics simulator, PhysGen models the dynamic interplay between the external environment and robot actions. We introduce a multimodal continuous representation that unifies video and action into shared physical tokens, bridging the gap between discrete video generation and continuous robotic control. This approach enables the seamless transfer of implicit physical knowledge-such as object permanence and dynamics-from video pretraining to downstream…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
