Anchor Forcing: Anchor Memory and Tri-Region RoPE for Interactive Streaming Video Diffusion
Yang Yang, Tianyi Zhang, Wei Huang, Jinwei Chen, Boxi Wu, Xiaofei He, Deng Cai, Bo Li, Peng-Tao Jiang

TL;DR
This paper introduces Anchor Forcing, a novel framework for interactive streaming video diffusion that enhances perceptual quality and motion consistency during prompt switching by using anchor caches and a tri-region RoPE approach.
Contribution
The paper proposes a cache-centric framework with anchor-guided re-cache and tri-region RoPE to address quality degradation and motion retention issues in streaming video diffusion.
Findings
Improves perceptual quality over prior streaming methods.
Enhances long-horizon motion retention in long videos.
Stabilizes quality during prompt switches.
Abstract
Interactive long video generation requires prompt switching to introduce new subjects or events, while maintaining perceptual fidelity and coherent motion over extended horizons. Recent distilled streaming video diffusion models reuse a rolling KV cache for long-range generation, enabling prompt-switch interaction through re-cache at each switch. However, existing streaming methods still exhibit progressive quality degradation and weakened motion dynamics. We identify two failure modes specific to interactive streaming generation: (i) at each prompt switch, current cache maintenance cannot simultaneously retain KV-based semantic context and recent latent cues, resulting in weak boundary conditioning and reduced perceptual quality; and (ii) during distillation, unbounded time indexing induces a positional distribution shift from the pretrained backbone's bounded RoPE regime, weakening…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Generative Adversarial Networks and Image Synthesis · Video Coding and Compression Technologies
