Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

Taiye Chen; Zihan Ding; Anjian Li; Christina Zhang; Zeqi Xiao; Yisen Wang; Chi Jin

arXiv:2511.12940·cs.CV·November 18, 2025

Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention

Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, Chi Jin

PDF

Open Access

TL;DR

This paper introduces Recurrent Autoregressive Diffusion (RAD), a novel framework combining RNNs with diffusion models to improve long-term video generation by effectively managing historical information within fixed memory constraints.

Contribution

The paper proposes RAD, a new autoregressive diffusion framework that uses LSTM-based memory updates for consistent long-term video generation, addressing limitations of existing diffusion-RNN approaches.

Findings

01

RAD outperforms existing models on Memory Maze and Minecraft datasets.

02

LSTM-based diffusion models achieve comparable performance to state-of-the-art RNN blocks.

03

RAD demonstrates superior long video generation capabilities with efficient memory management.

Abstract

Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis