TL;DR
This paper introduces PhysHPO, a hierarchical framework for optimizing physically plausible video generation by aligning content, motion, and semantics across multiple levels, and employs an automated data selection pipeline to enhance realism.
Contribution
It presents the first approach to fine-grained preference alignment and automated data selection for physically plausible video generation.
Findings
Significant improvement in physical plausibility of generated videos.
Enhanced overall video quality and realism.
Effective hierarchical optimization across multiple granularities.
Abstract
Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
