Progressive Autoregressive Video Diffusion Models
Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, Yang Zhou

TL;DR
This paper introduces a novel progressive noise scheduling approach for autoregressive video diffusion models, enabling high-quality, long-duration video generation with minimal quality loss, surpassing previous short clip limitations.
Contribution
It proposes a new noise level assignment and denoising strategy that improves long video generation quality and coherence in autoregressive video diffusion models.
Findings
Achieved 60-second text-conditioned video generation with high fidelity.
Significantly reduced scene abruptness and motion unnaturalness.
Demonstrated minimal quality degradation over extended video sequences.
Abstract
Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. Existing methods naively achieve autoregressive long video generation by directly placing the ending of the previous clip at the front of the attention window as conditioning, which leads to abrupt scene changes, unnatural motion, and error accumulation. In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models. Our key idea is to 1. assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and 2. denoise and shift the frames in small intervals rather than all at once. This allows for…
Peer Reviews
Decision·ICLR 2025 Conference Withdrawn Submission
1. PA-VDM extends the capabilities of existing video diffusion models to generate longer videos, up to 1 minute in length (1440 frames at 24 FPS), without compromising quality. This is achieved by assigning latent frames with progressively increasing noise levels, allowing for autoregressive generation without degradation. 2. PA-VDM maintains temporal consistency throughout the generated video, ensuring smooth transitions and realistic motion dynamics. This is in contrast to other methods that s
An important baseline is missed. The increasing-noise diffusion scheduler of PA-VDM is actually a multi-task training process, i.e., the model is trained on mixed types of conditions, and therefore the model can perform both video generation and video extending. Towards this goal, there is a more straightforward fine-tuning strategy than the increasing-noise diffusion scheduler, i.e., directly training the model on both text-to-video and video-to-video data. For example, if the maximum video
1. This paper proposes an autoregressive video diffusion model that denoises video frames in a progressive manner, allowing for both high-quality video content extension and smooth motion generation. 2. The method can be easily implemented by changing the noise scheduling of pre-trained video diffusion models. 3. The additional computational cost at inference time is not large.
1. Some baseline models should be added. For example, VideoCrafter2, T2V-Turbo, Open-Sora. 2. In figure 2, has the authors tried different noise increasing schedule? And what's the influence on model performance? 3. Is there any model complexity analysis? For example, #params, FLOPs.
The model shows promising results on generating long-term high resolution videos.
While this paper introduces a progressive noise schedule for window-based autoregressive video generation, there are several critical concerns that limit its contribution as a research paper. 1. **Lack of Novelty** (Existing previous work): The core idea of a progressive noise schedule across frames was previously proposed in the ICML 2024 paper [Rolling Diffusion Model](https://arxiv.org/pdf/2402.09470), which also employs this technique to enable autoregressive video generation. Although it’s
- The paper is generally well-written and easy to follow. - Qualitatively, the result is better than other naive baselines.
- Per-frame noise scheduling has been recently introduced by many works, to name a few [1, 2, 3]. In particular, [1, 2] also deals with the training scheme of video diffusion models to generate long videos. In the current status, it seems there's no technical difference between these works and the proposed method. The only difference seems to be that this paper fine-tunes existing video diffusion models, which are not designed with per-frame scheduling---but for me, this is quite a marginal cont
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Signal Denoising Methods
MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Diffusion
