Motion-Adaptive Temporal Attention for Lightweight Video Generation with Stable Diffusion
Rui Hong, Shuxue Quan

TL;DR
This paper introduces a motion-adaptive temporal attention mechanism for lightweight video generation using frozen Stable Diffusion models, dynamically adjusting attention based on motion to improve temporal consistency and detail.
Contribution
It proposes a novel, parameter-efficient temporal attention method that adapts to motion content, enhancing video quality without extensive additional training.
Findings
Achieves competitive results with only 25.8M additional parameters.
Implicit temporal regularization from the denoising objective outperforms explicit losses.
Provides a practical control for diverse generation behaviors via noise correlation and motion amplitude.
Abstract
We present a motion-adaptive temporal attention mechanism for parameter-efficient video generation built upon frozen Stable Diffusion models. Rather than treating all video content uniformly, our method dynamically adjusts temporal attention receptive fields based on estimated motion content: high-motion sequences attend locally across frames to preserve rapidly changing details, while low-motion sequences attend globally to enforce scene consistency. We inject lightweight temporal attention modules into all UNet transformer blocks via a cascaded strategy -- global attention in down-sampling and middle blocks for semantic stabilization, motion-adaptive attention in up-sampling blocks for fine-grained refinement. Combined with temporally correlated noise initialization and motion-aware gating, the system adds only 25.8M trainable parameters (2.9\% of the base UNet) while achieving…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Visual Attention and Saliency Detection
