PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment

Zhexiao Xiong; Yizhi Song; Liu He; Wei Xiong; Yu Yuan; Feng Qiao; Nathan Jacobs

arXiv:2603.13770·cs.CV·March 17, 2026

PhysAlign: Physics-Coherent Image-to-Video Generation through Feature and 3D Representation Alignment

Zhexiao Xiong, Yizhi Song, Liu He, Wei Xiong, Yu Yuan, Feng Qiao, Nathan Jacobs

PDF

Open Access

TL;DR

PhysAlign introduces a physics-coherent image-to-video generation framework that uses synthetic data and explicit 3D constraints to produce temporally stable videos aligned with physical laws.

Contribution

It presents a novel physics-grounded approach for video generation, utilizing a synthetic dataset and a unified physical latent space to improve temporal coherence.

Findings

01

Outperforms existing models in physical reasoning tasks

02

Achieves higher temporal stability without sacrificing visual quality

03

Bridges the gap between visual synthesis and physical kinematics

Abstract

Video Diffusion Models (VDMs) offer a promising approach for simulating dynamic scenes and environments, with broad applications in robotics and media generation. However, existing models often generate temporally incoherent content that violates basic physical intuition, significantly limiting their practical applicability. We propose PhysAlign, an efficient framework for physics-coherent image-to-video (I2V) generation that explicitly addresses this limitation. To overcome the critical scarcity of physics-annotated videos, we first construct a fully controllable synthetic data generation pipeline based on rigid-body simulation, yielding a highly-curated dataset with accurate, fine-grained physics and 3D annotations. Leveraging this data, PhysAlign constructs a unified physical latent space by coupling explicit 3D geometry constraints with a Gram-based spatio-temporal relational…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · 3D Shape Modeling and Analysis