BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Zeyu Zhang; Shuning Chang; Yuanyu He; Yizeng Han; Jiasheng Tang; Fan Wang; Bohan Zhuang

arXiv:2511.22973·cs.CV·December 1, 2025

BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation

Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang

PDF

Open Access 1 Datasets

TL;DR

BlockVid introduces a novel block diffusion framework with semantic-aware caching and specialized training strategies to generate high-quality, coherent minute-long videos, addressing long-horizon errors and coherence evaluation.

Contribution

The paper proposes BlockVid, a new block diffusion method with semantic-aware sparse KV cache and chunk-wise noise scheduling, plus a new benchmark LV-Bench for long-video coherence evaluation.

Findings

01

Outperforms existing methods in quality and coherence metrics.

02

Achieves 22.2% improvement on VDE Subject.

03

Achieves 19.4% improvement on VDE Clarity.

Abstract

Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

heyuanyu/LVG-Bench
dataset· 1.5k dl
1.5k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Vision and Imaging