Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

Kunyang Li; Mubarak Shah; Yuzhang Shang

arXiv:2605.16579·cs.CV·May 22, 2026

Attend Locally, Remember Linearly: Linear Attention as Cross-Frame Memory for Autoregressive Video Diffusion

Kunyang Li, Mubarak Shah, Yuzhang Shang

PDF

TL;DR

This paper introduces ARL2, a hybrid attention mechanism for autoregressive video diffusion that replaces quadratic attention with a fixed-size recurrent state, enabling linear scaling, reduced memory, and improved temporal consistency.

Contribution

It proposes a novel hybrid attention module combining intra-frame softmax and inter-frame recurrent linear attention, converting pretrained models into more scalable architectures.

Findings

01

Achieves up to 2.26x speedup and 54% memory reduction.

02

Maintains comparable video quality with improved temporal consistency.

03

First to convert pretrained AR video diffusion models into hybrid linear attention architectures.

Abstract

Autoregressive (AR) video diffusion is a powerful paradigm for streaming and interactive video generation. However, its reliance on softmax self-attention leads to quadratic compute complexity in sequence length and memory usage due to key-value caching, which limits its scalability to long video horizons. Existing remedies (e.g., sparse attention and KV-cache compression) reduce per-step cost but still rely on a linearly growing cache or irreversibly discard past context, and thus fail to address linear memory growth and streaming context management. To address this scalability bottleneck, we propose ARL2 (Attend Locally, Remember Linearly), a hybrid attention module that replaces quadratic cross-frame attention with a fixed-size recurrent state. We decompose self-attention into two branches: an intra-frame softmax branch for spatial detail and local dependencies, and an inter-frame…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.