MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling
Shubo Lin, Xuanyang Zhang, Wei Cheng, Weiming Hu, Gang Yu, Jin Gao

TL;DR
MMPhysVideo introduces a joint multimodal framework with a novel architecture and data pipeline to enhance physical plausibility in video generation, outperforming existing models.
Contribution
The paper presents MMPhysVideo, a pioneering approach that scales physical plausibility in video diffusion models using joint multimodal modeling and a new data curation pipeline.
Findings
Improves physical plausibility and visual quality in video generation.
Achieves state-of-the-art performance on multiple benchmarks.
Introduces MMPhysPipe for scalable multimodal dataset construction.
Abstract
Despite advancements in generating visually stunning content, video diffusion models (VDMs) often yield physically inconsistent results due to pixel-only reconstruction. To address this, we propose MMPhysVideo, the first framework to scale physical plausibility in video generation through joint multimodal modeling. We recast perceptual cues, specifically semantics, geometry, and spatio-temporal trajectory, into a unified pseudo-RGB format, enabling VDMs to directly capture complex physical dynamics. To mitigate cross-modal interference, we propose a Bidirectionally Controlled Teacher architecture, which utilizes parallel branches to fully decouple RGB and perception processing and adopts two zero-initialized control links to gradually learn pixel-wise consistency. For inference efficiency, the teacher's physical prior is distilled into a single-stream student model via representation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
