Progressive Autoregressive Video Diffusion Models

Desai Xie; Zhan Xu; Yicong Hong; Hao Tan; Difan Liu; Feng Liu; Arie Kaufman; Yang Zhou

arXiv:2410.08151·cs.CV·May 20, 2025

Progressive Autoregressive Video Diffusion Models

Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, Yang Zhou

PDF

Open Access 1 Repo 4 Reviews

TL;DR

This paper introduces a novel progressive noise scheduling approach for autoregressive video diffusion models, enabling high-quality, long-duration video generation with minimal quality loss, surpassing previous short clip limitations.

Contribution

It proposes a new noise level assignment and denoising strategy that improves long video generation quality and coherence in autoregressive video diffusion models.

Findings

01

Achieved 60-second text-conditioned video generation with high fidelity.

02

Significantly reduced scene abruptness and motion unnaturalness.

03

Demonstrated minimal quality degradation over extended video sequences.

Abstract

Current frontier video diffusion models have demonstrated remarkable results at generating high-quality videos. However, they can only generate short video clips, normally around 10 seconds or 240 frames, due to computation limitations during training. Existing methods naively achieve autoregressive long video generation by directly placing the ending of the previous clip at the front of the attention window as conditioning, which leads to abrupt scene changes, unnatural motion, and error accumulation. In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models. Our key idea is to 1. assign the frames with per-frame, progressively increasing noise levels rather than a single noise level and 2. denoise and shift the frames in small intervals rather than all at once. This allows for…

Peer Reviews

Decision·ICLR 2025 Conference Withdrawn Submission

Reviewer 01Rating 5Confidence 4

Strengths

1. PA-VDM extends the capabilities of existing video diffusion models to generate longer videos, up to 1 minute in length (1440 frames at 24 FPS), without compromising quality. This is achieved by assigning latent frames with progressively increasing noise levels, allowing for autoregressive generation without degradation. 2. PA-VDM maintains temporal consistency throughout the generated video, ensuring smooth transitions and realistic motion dynamics. This is in contrast to other methods that s

Weaknesses

An important baseline is missed. The increasing-noise diffusion scheduler of PA-VDM is actually a multi-task training process, i.e., the model is trained on mixed types of conditions, and therefore the model can perform both video generation and video extending. Towards this goal, there is a more straightforward fine-tuning strategy than the increasing-noise diffusion scheduler, i.e., directly training the model on both text-to-video and video-to-video data. For example, if the maximum video

Reviewer 02Rating 5Confidence 4

Strengths

1. This paper proposes an autoregressive video diffusion model that denoises video frames in a progressive manner, allowing for both high-quality video content extension and smooth motion generation. 2. The method can be easily implemented by changing the noise scheduling of pre-trained video diffusion models. 3. The additional computational cost at inference time is not large.

Weaknesses

1. Some baseline models should be added. For example, VideoCrafter2, T2V-Turbo, Open-Sora. 2. In figure 2, has the authors tried different noise increasing schedule? And what's the influence on model performance? 3. Is there any model complexity analysis? For example, #params, FLOPs.

Reviewer 03Rating 3Confidence 4

Strengths

The model shows promising results on generating long-term high resolution videos.

Weaknesses

While this paper introduces a progressive noise schedule for window-based autoregressive video generation, there are several critical concerns that limit its contribution as a research paper. 1. **Lack of Novelty** (Existing previous work): The core idea of a progressive noise schedule across frames was previously proposed in the ICML 2024 paper [Rolling Diffusion Model](https://arxiv.org/pdf/2402.09470), which also employs this technique to enable autoregressive video generation. Although it’s

Reviewer 04Rating 3Confidence 5

Strengths

- The paper is generally well-written and easy to follow. - Qualitatively, the result is better than other naive baselines.

Weaknesses

- Per-frame noise scheduling has been recently introduced by many works, to name a few [1, 2, 3]. In particular, [1, 2] also deals with the training scheme of video diffusion models to generate long videos. In the current status, it seems there's no technical difference between these works and the proposed method. The only difference seems to be that this paper fine-tunes existing video diffusion models, which are not designed with per-frame scheduling---but for me, this is quite a marginal cont

Code & Models

Repositories

desaixie/pa_vdm
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Signal Denoising Methods

MethodsSoftmax · Attention Is All You Need · Contrastive Language-Image Pre-training · Diffusion