Physics-Guided Motion Loss for Video Generation Model

Bowen Xue; Giuseppe Claudio Guarnera; Shuang Zhao; Zahra Montazeri

arXiv:2506.02244·cs.CV·September 29, 2025

Physics-Guided Motion Loss for Video Generation Model

Bowen Xue, Giuseppe Claudio Guarnera, Shuang Zhao, Zahra Montazeri

PDF

3 Reviews

TL;DR

This paper introduces a frequency-domain physics prior that enhances motion plausibility in video diffusion models by decomposing rigid motions into spectral losses, leading to more realistic and consistent video generation without altering model architectures.

Contribution

The authors propose a novel spectral loss method for enforcing physical motion constraints in video diffusion models, improving motion accuracy and realism.

Findings

01

Improves motion accuracy and action recognition by ~11% on OpenVID-1M.

02

Reduces warping error by 22--37%, depending on backbone.

03

User preference for physics-enhanced videos is 74--83%.

Abstract

Current video diffusion models generate visually compelling content but often violate basic laws of physics, producing subtle artifacts like rubber-sheet deformations and inconsistent object motion. We introduce a frequency-domain physics prior that improves motion plausibility without modifying model architectures. Our method decomposes common rigid motions (translation, rotation, scaling) into lightweight spectral losses, requiring only 2.7% of frequency coefficients while preserving 97%+ of spectral energy. Applied to Open-Sora, MVDIT, and Hunyuan, our approach improves both motion accuracy and action recognition by ~11% on average on OpenVID-1M (relative), while maintaining visual quality. User studies show 74--83% preference for our physics-enhanced videos. It also reduces warping error by 22--37% (depending on the backbone) and improves temporal consistency scores. These results…

Peer Reviews

Decision·ICLR 2026 Conference Desk Rejected Submission

Reviewer 01Rating 8Confidence 4

Strengths

The paper introduces a novel frequency-domain regularization for video diffusion models that leverages spectral signatures of translation, rotation, and scaling to guide learning without altering model architecture. The strengths are: - The idea of combining classical ideas from Fourier analysis and the SIM(2) motion group with modern video diffusion models, demonstrating a creative synthesis of physics-based priors and deep generative modeling. - The authors provide a thorough derivation conn

Weaknesses

- Although the theory is solid, as a paper in the video generation field, its presentation lacks some intuitive visualizations, such as visual demonstrations of spectral changes, and the qualitative evaluation is relatively limited; - In the Abstract, “regularizer” is written as “regular- izer,” which looks like a copy-paste error; - On the first page, in the “four groups” listing, why only (i) is bolded; - As an important demonstration, the supplementary video is of poor asthetic quality an

Reviewer 02Rating 8Confidence 3

Strengths

It is novel and interesting to regularize basic global physical motions in the frequency domain, a simple yet effective approach that can be easily integrated into any video generation model.

Weaknesses

1. The applicable physics motion patterns are limited to rotation, translation and scaling. 2. My understanding is that the method is primarily effective for videos containing a single dominant motion and cannot handle scenarios involving multiple objects moving differently or simultaneous camera and object motion.

Reviewer 03Rating 6Confidence 2

Strengths

- This paper explores an important problem in video generation. - The SIM(2)-based spectral derivation unifies translation, rotation, and scaling within a mathematically sound framework. - The loss is architecture-agnostic and can be inserted into any diffusion model without modifying the backbone. - Evaluation spans three major video diffusion systems and includes multiple metrics.

Weaknesses

- The innovation lies mainly in unifying them under the SIM(2) formulation. - The method only addresses translation, rotation, and scaling, which limits applicability to real-world complex scenes. - There are some related physics-constrained video generation works, such as [a], which should also be discussed. Also, except for comparing with the baseline models, it should compare with some related works. - [b] is a comprehensive physics generation benchmark designed to evaluate physical commonsen

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.