Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Dvir Samuel; Issar Tzachor; Matan Levy; Micahel Green; Gal Chechik; Rami Ben-Ari

arXiv:2602.01801·cs.CV·February 3, 2026

Fast Autoregressive Video Diffusion and World Models with Temporal Cache Compression and Sparse Attention

Dvir Samuel, Issar Tzachor, Matan Levy, Micahel Green, Gal Chechik, Rami Ben-Ari

PDF

Open Access

TL;DR

This paper introduces a unified, training-free attention framework for autoregressive video diffusion models that significantly reduces computation and memory bottlenecks, enabling faster and more stable long-form video generation.

Contribution

It proposes TempCache, AnnCA, and AnnSA modules that together optimize attention mechanisms, improving speed and memory efficiency without retraining or sacrificing quality.

Findings

01

Achieves up to 10x speedup in video generation

02

Maintains near-identical visual quality

03

Ensures stable GPU memory usage over long sequences

Abstract

Autoregressive video diffusion models enable streaming generation, opening the door to long-form synthesis, video world models, and interactive neural game engines. However, their core attention layers become a major bottleneck at inference time: as generation progresses, the KV cache grows, causing both increasing latency and escalating GPU memory, which in turn restricts usable temporal context and harms long-range consistency. In this work, we study redundancy in autoregressive video diffusion and identify three persistent sources: near-duplicate cached keys across frames, slowly evolving (largely semantic) queries/keys that make many attention computations redundant, and cross-attention over long prompts where only a small subset of tokens matters per frame. Building on these observations, we propose a unified, training-free attention framework for autoregressive diffusion:…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Visual Attention and Saliency Detection · Human Pose and Action Recognition