Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion
Kunyang Li, Mubarak Shah, Yuzhang Shang

TL;DR
This paper introduces ARL2, a hybrid attention mechanism for autoregressive video diffusion that replaces quadratic attention with a fixed-size recurrent state, enabling linear scaling, reduced memory, and improved temporal consistency.
Contribution
It proposes a novel hybrid attention module combining intra-frame softmax and inter-frame recurrent linear attention, converting pretrained models into more scalable architectures.
Findings
Achieves up to 2.26x speedup and 54% memory reduction.
Maintains comparable video quality with improved temporal consistency.
First to convert pretrained AR video diffusion models into hybrid linear attention architectures.
Abstract
Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
