Moaw: Unleashing Motion Awareness for Video Diffusion Models
Tianqi Zhang, Ziyi Wang, Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Zhengyang Huang, Jie Zhou, Jiwen Lu

TL;DR
This paper introduces Moaw, a framework that enhances motion awareness in video diffusion models, enabling effective motion transfer and bridging generative modeling with motion understanding.
Contribution
We propose a supervised training approach that shifts video diffusion models from image-to-video generation to video-to-dense-tracking, facilitating zero-shot motion transfer.
Findings
Effective motion transfer without additional adapters
Enhanced motion understanding in diffusion models
Bridging generative modeling and motion perception
Abstract
Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The idea of training a video diffusion model to predict dense trajectories is interesting. 2. This paper designs a motion-labeled video dataset for the analysis of motion-aware features is insightful. 3. The paper is well-written and easy to follow.
1. Despite the interesting design, the proposed method achieves worse results than the baselines, as shown in Table 1 and Table 2. This limits the application of the proposed method. 2. This paper highlights the point tracking task, but lacks one important benchmark, TAP-Vid [1]. How about the tracking performance compared to non-diffusion methods such as CoTracker3 on TAP-Vid? [1] TAP-Vid: A Benchmark for Tracking Any Point in a Video 3. This paper evaluates videos of no more than 48 frames.
- Using motion-visualization maps (color-mapped trajectories) as input features helps capture coarse motion while reducing appearance dependence. - Converting trajectories to a colormap and aligning them with the VAE latent space is a clever design choice. - PCA-based feature analysis is a reasonable attempt at interpretability. The finding that mid-level blocks encode the most discriminative motion information is intuitive and aligns with observations from ControlNet-style architectures.
- The authors argue that no adapter is needed, but the method requires training a full diffusion UNet on dense 3D trajectories with the same parameter count as the generation UNet. Compared to lightweight adapter methods, this approach may be less efficient overall. A rigorous comparison of training/inference cost against adapter-based baselines is missing, making the “adapter-free” claim feel incomplete. - The motion-labeled dataset consists only of six controlled camera motions on static Scan
1. The authors present a novel method to zero-shot transfer motion from a reference video to a newly generated video. 2. The authors also create a dense video 3d point tracking model in the process which trades tracking accuracy for inference latency compared to existing approaches. 3. The authors demonstrate qualitatively and quantitatively the effectiveness of their motion transfer approach.
1. It would be interesting to see, how the motion transfer method generalises to more complex motions, for example generated using RecamMaster [1], using the same selected features as in the paper. 2. The selected feature ablation needs some quantitative results as well. 3. One interesting analysis to do would be to separate fore-ground object motion and camera motion and see if they can be transferred independent of each other. 4. It would be very relevant and timely to try out this same metho
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Model Reduction and Neural Networks
