Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Tianrui Zhu; Shiyi Zhang; Zhirui Sun; Jingqi Tian; Yansong Tang

arXiv:2512.18741·cs.CV·December 24, 2025

Memorize-and-Generate: Towards Long-Term Consistency in Real-Time Video Generation

Tianrui Zhu, Shiyi Zhang, Zhirui Sun, Jingqi Tian, Yansong Tang

PDF

Open Access

TL;DR

This paper introduces Memorize-and-Generate (MAG), a novel framework for long-term consistent real-time video generation that effectively balances memory efficiency and scene coherence by separating memory compression from frame synthesis.

Contribution

MAG decouples memory compression and frame generation, enabling long-term scene consistency without excessive memory use, and introduces MAG-Bench for rigorous evaluation of memory retention.

Findings

01

MAG outperforms existing models in scene consistency over long videos.

02

MAG maintains competitive performance on standard benchmarks.

03

MAG-Bench effectively evaluates historical memory retention.

Abstract

Frame-level autoregressive (frame-AR) models have achieved significant progress, enabling real-time video generation comparable to bidirectional diffusion models and serving as a foundation for interactive world models and game engines. However, current approaches in long video generation typically rely on window attention, which naively discards historical context outside the window, leading to catastrophic forgetting and scene inconsistency; conversely, retaining full history incurs prohibitive memory costs. To address this trade-off, we propose Memorize-and-Generate (MAG), a framework that decouples memory compression and frame generation into distinct tasks. Specifically, we train a memory model to compress historical information into a compact KV cache, and a separate generator model to synthesize subsequent frames utilizing this compressed representation. Furthermore, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Pose and Action Recognition · Human Motion and Animation