Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan; Xiaofeng Zhang; Felix Friedrich; Nicolas Beltran-Velez; Melissa Hall; Reyhane Askari-Hemmat; Xiaochuang Han; Nicolas Ballas; Michal Drozdzal; Adriana Romero-Soriano

arXiv:2601.10553·cs.CV·March 2, 2026

Inference-time Physics Alignment of Video Generative Models with Latent World Models

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano

PDF

Open Access

TL;DR

This paper introduces a novel inference-time alignment method using latent world models to enhance the physics plausibility of video generative models, significantly improving their realism and winning a major challenge.

Contribution

It proposes WMReward, an inference strategy leveraging latent world models as rewards to improve physics consistency in video generation.

Findings

01

Substantially improved physics plausibility across various generation settings.

02

Achieved 62.64% in the PhysicsIQ Challenge, outperforming previous methods by 7.42%.

03

Validated through human preference studies.

Abstract

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition