Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

Harold Haodong Chen; Haojian Huang; Qifeng Chen; Harry Yang; Ser-Nam Lim

arXiv:2508.10858·cs.CV·August 15, 2025

Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation

Harold Haodong Chen, Haojian Huang, Qifeng Chen, Harry Yang, Ser-Nam Lim

PDF

1 Video

TL;DR

This paper introduces PhysHPO, a hierarchical framework for optimizing physically plausible video generation by aligning content, motion, and semantics across multiple levels, and employs an automated data selection pipeline to enhance realism.

Contribution

It presents the first approach to fine-grained preference alignment and automated data selection for physically plausible video generation.

Findings

01

Significant improvement in physical plausibility of generated videos.

02

Enhanced overall video quality and realism.

03

Effective hierarchical optimization across multiple granularities.

Abstract

Recent advancements in video generation have enabled the creation of high-quality, visually compelling videos. However, generating videos that adhere to the laws of physics remains a critical challenge for applications requiring realism and accuracy. In this work, we propose PhysHPO, a novel framework for Hierarchical Cross-Modal Direct Preference Optimization, to tackle this challenge by enabling fine-grained preference alignment for physically plausible video generation. PhysHPO optimizes video alignment across four hierarchical granularities: a) Instance Level, aligning the overall video content with the input prompt; b) State Level, ensuring temporal consistency using boundary frames as anchors; c) Motion Level, modeling motion trajectories for realistic dynamics; and d) Semantic Level, maintaining logical consistency between narrative and visuals. Recognizing that real-world videos…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Hierarchical Fine-grained Preference Optimization for Physically Plausible Video Generation· slideslive