TL;DR
Echo-Forcing introduces a scene memory framework for interactive long video generation, enabling smooth scene transitions, long-term recall, and prompt responsiveness without additional training.
Contribution
It proposes a training-free, hierarchical scene memory framework with novel mechanisms to improve interactive long video generation.
Findings
Achieves superior performance on VBench-Long in long-video and interactive scenarios.
Supports smooth transitions, hard cuts, and long-range scene recall within a bounded cache.
Demonstrates effectiveness through extensive evaluations.
Abstract
Autoregressive video diffusion models enable open-ended generation through local attention and KV caching. However, existing training-free long-video optimization methods mainly focus on stable extension under a single prompt, making them difficult to handle interactive scenarios involving prompt switching, old scene forgetting, and historical scene recall. We identify the core bottleneck as the functional entanglement of historical KV states: stable anchors and recent dynamics are handled by the same cache policy, leading to outdated background contamination, delayed response to new prompts, and loss of long-range memory. To address this issue, we propose Echo-Forcing, a training-free scene memory framework specifically designed for interactive long video generation with three core mechanisms: (1) Hierarchical Temporal Memory, which decouples stable anchors, compressed history, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
