Moaw: Unleashing Motion Awareness for Video Diffusion Models

Tianqi Zhang; Ziyi Wang; Wenzhao Zheng; Weiliang Chen; Yuanhui Huang; Zhengyang Huang; Jie Zhou; Jiwen Lu

arXiv:2601.12761·cs.CV·January 21, 2026

Moaw: Unleashing Motion Awareness for Video Diffusion Models

Tianqi Zhang, Ziyi Wang, Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Zhengyang Huang, Jie Zhou, Jiwen Lu

PDF

Open Access 3 Reviews

TL;DR

This paper introduces Moaw, a framework that enhances motion awareness in video diffusion models, enabling effective motion transfer and bridging generative modeling with motion understanding.

Contribution

We propose a supervised training approach that shifts video diffusion models from image-to-video generation to video-to-dense-tracking, facilitating zero-shot motion transfer.

Findings

01

Effective motion transfer without additional adapters

02

Enhanced motion understanding in diffusion models

03

Bridging generative modeling and motion perception

Abstract

Video diffusion models, trained on large-scale datasets, naturally capture correspondences of shared features across frames. Recent works have exploited this property for tasks such as optical flow prediction and tracking in a zero-shot setting. Motivated by these findings, we investigate whether supervised training can more fully harness the tracking capability of video diffusion models. To this end, we propose Moaw, a framework that unleashes motion awareness for video diffusion models and leverages it to facilitate motion transfer. Specifically, we train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking. We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model. Owing to the homogeneity between…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The idea of training a video diffusion model to predict dense trajectories is interesting. 2. This paper designs a motion-labeled video dataset for the analysis of motion-aware features is insightful. 3. The paper is well-written and easy to follow.

Weaknesses

1. Despite the interesting design, the proposed method achieves worse results than the baselines, as shown in Table 1 and Table 2. This limits the application of the proposed method. 2. This paper highlights the point tracking task, but lacks one important benchmark, TAP-Vid [1]. How about the tracking performance compared to non-diffusion methods such as CoTracker3 on TAP-Vid? [1] TAP-Vid: A Benchmark for Tracking Any Point in a Video 3. This paper evaluates videos of no more than 48 frames.

Reviewer 02Rating 2Confidence 4

Strengths

- Using motion-visualization maps (color-mapped trajectories) as input features helps capture coarse motion while reducing appearance dependence. - Converting trajectories to a colormap and aligning them with the VAE latent space is a clever design choice. - PCA-based feature analysis is a reasonable attempt at interpretability. The finding that mid-level blocks encode the most discriminative motion information is intuitive and aligns with observations from ControlNet-style architectures.

Weaknesses

- The authors argue that no adapter is needed, but the method requires training a full diffusion UNet on dense 3D trajectories with the same parameter count as the generation UNet. Compared to lightweight adapter methods, this approach may be less efficient overall. A rigorous comparison of training/inference cost against adapter-based baselines is missing, making the “adapter-free” claim feel incomplete. - The motion-labeled dataset consists only of six controlled camera motions on static Scan

Reviewer 03Rating 6Confidence 3

Strengths

1. The authors present a novel method to zero-shot transfer motion from a reference video to a newly generated video. 2. The authors also create a dense video 3d point tracking model in the process which trades tracking accuracy for inference latency compared to existing approaches. 3. The authors demonstrate qualitatively and quantitatively the effectiveness of their motion transfer approach.

Weaknesses

1. It would be interesting to see, how the motion transfer method generalises to more complex motions, for example generated using RecamMaster [1], using the same selected features as in the paper. 2. The selected feature ablation needs some quantitative results as well. 3. One interesting analysis to do would be to separate fore-ground object motion and camera motion and see if they can be transferred independent of each other. 4. It would be very relevant and timely to try out this same metho

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Model Reduction and Neural Networks