TL;DR
This paper introduces a frequency-domain physics prior that enhances motion plausibility in video diffusion models by decomposing rigid motions into spectral losses, leading to more realistic and consistent video generation without altering model architectures.
Contribution
The authors propose a novel spectral loss method for enforcing physical motion constraints in video diffusion models, improving motion accuracy and realism.
Findings
Improves motion accuracy and action recognition by ~11% on OpenVID-1M.
Reduces warping error by 22--37%, depending on backbone.
User preference for physics-enhanced videos is 74--83%.
Abstract
Current video diffusion models generate visually compelling content but often violate basic laws of physics, producing subtle artifacts like rubber-sheet deformations and inconsistent object motion. We introduce a frequency-domain physics prior that improves motion plausibility without modifying model architectures. Our method decomposes common rigid motions (translation, rotation, scaling) into lightweight spectral losses, requiring only 2.7% of frequency coefficients while preserving 97%+ of spectral energy. Applied to Open-Sora, MVDIT, and Hunyuan, our approach improves both motion accuracy and action recognition by ~11% on average on OpenVID-1M (relative), while maintaining visual quality. User studies show 74--83% preference for our physics-enhanced videos. It also reduces warping error by 22--37% (depending on the backbone) and improves temporal consistency scores. These results…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
The paper introduces a novel frequency-domain regularization for video diffusion models that leverages spectral signatures of translation, rotation, and scaling to guide learning without altering model architecture. The strengths are: - The idea of combining classical ideas from Fourier analysis and the SIM(2) motion group with modern video diffusion models, demonstrating a creative synthesis of physics-based priors and deep generative modeling. - The authors provide a thorough derivation conn
- Although the theory is solid, as a paper in the video generation field, its presentation lacks some intuitive visualizations, such as visual demonstrations of spectral changes, and the qualitative evaluation is relatively limited; - In the Abstract, “regularizer” is written as “regular- izer,” which looks like a copy-paste error; - On the first page, in the “four groups” listing, why only (i) is bolded; - As an important demonstration, the supplementary video is of poor asthetic quality an
It is novel and interesting to regularize basic global physical motions in the frequency domain, a simple yet effective approach that can be easily integrated into any video generation model.
1. The applicable physics motion patterns are limited to rotation, translation and scaling. 2. My understanding is that the method is primarily effective for videos containing a single dominant motion and cannot handle scenarios involving multiple objects moving differently or simultaneous camera and object motion.
- This paper explores an important problem in video generation. - The SIM(2)-based spectral derivation unifies translation, rotation, and scaling within a mathematically sound framework. - The loss is architecture-agnostic and can be inserted into any diffusion model without modifying the backbone. - Evaluation spans three major video diffusion systems and includes multiple metrics.
- The innovation lies mainly in unifying them under the SIM(2) formulation. - The method only addresses translation, rotation, and scaling, which limits applicability to real-world complex scenes. - There are some related physics-constrained video generation works, such as [a], which should also be discussed. Also, except for comparing with the baseline models, it should compare with some related works. - [b] is a comprehensive physics generation benchmark designed to evaluate physical commonsen
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
