Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models

Tianle Cheng; Zeyan Zhang; Kaifeng Gao; Jun Xiao

arXiv:2511.12099·cs.CV·November 18, 2025

Adaptive Begin-of-Video Tokens for Autoregressive Video Diffusion Models

Tianle Cheng, Zeyan Zhang, Kaifeng Gao, Jun Xiao

PDF

Open Access

TL;DR

This paper introduces adaptive begin-of-video tokens for autoregressive video diffusion models, enhancing long video generation by improving global consistency and local dynamics through learnable embeddings and a refinement strategy.

Contribution

The paper proposes a novel adaptive BOV token mechanism and a refinement strategy for stream denoising, advancing long video generation quality and consistency in diffusion models.

Findings

01

Achieves better global consistency in long video synthesis.

02

Improves local motion dynamics and image quality.

03

Demonstrates superior performance on multiple metrics.

Abstract

Recent advancements in diffusion-based video generation have produced impressive and high-fidelity short videos. To extend these successes to generate coherent long videos, most video diffusion models (VDMs) generate videos in an autoregressive manner, i.e., generating subsequent frames conditioned on previous ones. There are generally two primary paradigms: chunk-based extension and stream denoising. The former directly concatenates previous clean frames as conditioning, suffering from denoising latency and error accumulation. The latter maintains the denoising sequence with monotonically increasing noise levels. In each denoising iteration, one clean frame is produced while a new pure noise is simultaneously appended, enabling live-stream sampling. However, it struggles with fragile consistency and poor motion dynamics. In this paper, we propose Adaptive Begin-of-Video Tokens…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Image Processing Techniques · Advanced Vision and Imaging