PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

Siddarth Nilol Kundur Satish; Devesh Jaiswal; Hongyu Chen; Abhishek Bakshi

arXiv:2601.03665·cs.CV·January 8, 2026

PhysVideoGenerator: Towards Physically Aware Video Generation via Latent Physics Guidance

Siddarth Nilol Kundur Satish, Devesh Jaiswal, Hongyu Chen, Abhishek Bakshi

PDF

Open Access

TL;DR

PhysVideoGenerator introduces a physics-aware video generation framework that embeds learnable physics priors into the diffusion process, improving realism by modeling real-world physics dynamics.

Contribution

This work demonstrates the feasibility of integrating a learnable physics prior into diffusion-based video generation, a novel approach in physics-aware generative modeling.

Findings

01

Diffusion latents contain sufficient information to recover physical representations.

02

Joint training of physics prior and video generator remains stable.

03

Framework establishes a foundation for future physics-aware video synthesis evaluation.

Abstract

Current video generation models produce high-quality aesthetic videos but often struggle to learn representations of real-world physics dynamics, resulting in artifacts such as unnatural object collisions, inconsistent gravity, and temporal flickering. In this work, we propose PhysVideoGenerator, a proof-of-concept framework that explicitly embeds a learnable physics prior into the video generation process. We introduce a lightweight predictor network, PredictorP, which regresses high-level physical features extracted from a pre-trained Video Joint Embedding Predictive Architecture (V-JEPA 2) directly from noisy diffusion latents. These predicted physics tokens are injected into the temporal attention layers of a DiT-based generator (Latte) via a dedicated cross-attention mechanism. Our primary contribution is demonstrating the technical feasibility of this joint training paradigm: we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Music Technology and Sound Studies · Human Motion and Animation