MoAlign: Motion-Centric Representation Alignment for Video Diffusion Models
Aritra Bhowmik, Denis Korzhenkov, Cees G. M. Snoek, Amirhossein Habibian, Mohsen Ghafoorian

TL;DR
This paper introduces MoAlign, a motion-centric alignment framework that disentangles and learns true motion dynamics from pretrained video encoders, improving the temporal coherence and physical plausibility of generated videos in diffusion models.
Contribution
MoAlign proposes a novel disentangled motion subspace learned from pretrained encoders, enhancing motion understanding in text-to-video diffusion models for more realistic video synthesis.
Findings
Improves physical commonsense in video generation.
Enhances temporal coherence and motion plausibility.
Outperforms previous methods on multiple benchmarks.
Abstract
Text-to-video diffusion models have enabled high-quality video synthesis, yet often fail to generate temporally coherent and physically plausible motion. A key reason is the models' insufficient understanding of complex motions that natural videos often entail. Recent works tackle this problem by aligning diffusion model features with those from pretrained video encoders. However, these encoders mix video appearance and dynamics into entangled features, limiting the benefit of such alignment. In this paper, we propose a motion-centric alignment framework that learns a disentangled motion subspace from a pretrained video encoder. This subspace is optimized to predict ground-truth optical flow, ensuring it captures true motion dynamics. We then align the latent features of a text-to-video diffusion model to this new subspace, enabling the generative model to internalize motion knowledge…
Peer Reviews
Decision·ICLR 2026 Poster
**1) Relevance and Timeliness** Motion-centric alignment addresses a timely limitation in large video diffusion models, namely the lack of explicit motion understanding, and the approach fits well within the ongoing trend of representation-based fine-tuning. **2) Clear Methodology** The two-stage design (motion feature learning → diffusion feature alignment) is logically presented and easy to follow, providing a clean conceptual link between motion representation learning and diffusion adapta
**1) Literature Coverage** While the introduction claims that prior alignment-based methods mostly focus on visual semantics rather than true motion dynamics, the paper overlooks several recent studies that explicitly target motion-aligned or dynamics-centric representations (e.g.[1], [2]). These works similarly aim to internalize motion dynamics rather than relying on appearance cues. Clarifying how the proposed framework surpasses such motion-aware alignment approaches would help position thi
1. The paper identifies a key limitation of existing representation between appearance and motion. The proposed two-stage approach is a novel and intuitive solution. 2. Using optical flow as an explicit signal to capture motion-centric representation space is a logical and strong choice into the diffusion model. 3. The paper validates its approach using a wide range of benchmarks.
1. Although the effectiveness of proposed method is validated by physical-centric benchmarks, it is demonstrated exclusively on CogVideoX-2B. It is unclear if this motion alignment approach would yield similar benefits for other SoTA video diffusion architectures, which may have different internal representations. 2. The claim of improved plausibility is accompanied by a drop in metrics related to motion dynamism. Specifically, MoAlign shows a large reduction in VBench's 'Dynamic Degree' (70.3
- The work addresses an important and timely problem. Although video models achieve generating visually plausible videos, most models lack the capability of generating physically plausible motion. - The two stage framework is conceptually simple and well-explained. The idea of first learning a motion-specific latent subspace and then aligning the diffusion model's representations to that subspace is innovative. This decoupling of motion from appearance is a neat solution to force the model to i
- The paper's most significant flaw is its failure to cite, discuss, or compare against "VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models" by Chefer et al. (2025). This omission is critical because VideoJAM addresses the exact same problem with a similar philosophy of incorporating an explicit motion signal, but through a fundamentally different mechanism. Both MoAlign and VideoJAM identify that the standard pixel-reconstruction objective is insuff
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Human Pose and Action Recognition
