Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention
Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, Rami Ben-Ari

TL;DR
This paper introduces a unified, training-free attention framework for autoregressive video diffusion models that significantly reduces computation and memory bottlenecks, enabling faster and more stable long-form video generation.
Contribution
It proposes TempCache, AnnCA, and AnnSA modules that together optimize attention mechanisms, improving speed and memory efficiency without retraining or sacrificing quality.
Findings
Achieves up to 10x speedup in video generation
Maintains near-identical visual quality
Ensures stable GPU memory usage over long sequences
Abstract
Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion:…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Human Pose and Action Recognition
