BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang

TL;DR
BlockVid introduces a novel block diffusion framework with semantic-aware caching and specialized training strategies to generate high-quality, coherent minute-long videos, addressing long-horizon errors and coherence evaluation.
Contribution
The paper proposes BlockVid, a new block diffusion method with semantic-aware sparse KV cache and chunk-wise noise scheduling, plus a new benchmark LV-Bench for long-video coherence evaluation.
Findings
Outperforms existing methods in quality and coherence metrics.
Achieves 22.2% improvement on VDE Subject.
Achieves 19.4% improvement on VDE Clarity.
Abstract
Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Advanced Vision and Imaging
