Recurrent Autoregressive Diffusion: Global Memory Meets Local Attention
Taiye Chen, Zihan Ding, Anjian Li, Christina Zhang, Zeqi Xiao, Yisen Wang, Chi Jin

TL;DR
This paper introduces Recurrent Autoregressive Diffusion (RAD), a novel framework combining RNNs with diffusion models to improve long-term video generation by effectively managing historical information within fixed memory constraints.
Contribution
The paper proposes RAD, a new autoregressive diffusion framework that uses LSTM-based memory updates for consistent long-term video generation, addressing limitations of existing diffusion-RNN approaches.
Findings
RAD outperforms existing models on Memory Maze and Minecraft datasets.
LSTM-based diffusion models achieve comparable performance to state-of-the-art RNN blocks.
RAD demonstrates superior long video generation capabilities with efficient memory management.
Abstract
Recent advancements in video generation have demonstrated the potential of using video diffusion models as world models, with autoregressive generation of infinitely long videos through masked conditioning. However, such models, usually with local full attention, lack effective memory compression and retrieval for long-term generation beyond the window size, leading to issues of forgetting and spatiotemporal inconsistencies. To enhance the retention of historical information within a fixed memory budget, we introduce a recurrent neural network (RNN) into the diffusion transformer framework. Specifically, a diffusion model incorporating LSTM with attention achieves comparable performance to state-of-the-art RNN blocks, such as TTT and Mamba2. Moreover, existing diffusion-RNN approaches often suffer from performance degradation due to training-inference gap or the lack of overlap across…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Domain Adaptation and Few-Shot Learning · Face recognition and analysis
