Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Jintao Chen; Chengyu Bai; Junjun Hu; Xinda Xue; Mu Xu

arXiv:2604.06939·cs.CV·April 14, 2026

Grounded Forcing: Bridging Time-Independent Semantics and Proximal Dynamics in Autoregressive Video Synthesis

Jintao Chen, Chengyu Bai, Junjun Hu, Xinda Xue, Mu Xu

PDF

TL;DR

Grounded Forcing introduces a comprehensive framework for long-term, coherent, and controllable autoregressive video synthesis by integrating semantic stability, positional consistency, and smooth prompt transitions.

Contribution

It presents three novel mechanisms—Dual Memory KV Cache, Dual-Reference RoPE Injection, and Asymmetric Proximity Recache—that collectively improve long-range coherence and controllability in video generation.

Findings

01

Enhanced long-term semantic coherence and identity stability.

02

Reduced visual drift and improved visual stability.

03

Facilitated smooth semantic inheritance during prompt transitions.

Abstract

Autoregressive video synthesis offers a promising pathway for infinite-horizon generation but is fundamentally hindered by three intertwined challenges: semantic forgetting from context limitations, visual drift due to positional extrapolation, and controllability loss during interactive instruction switching. Current methods often tackle these issues in isolation, limiting long-term coherence. We introduce Grounded Forcing, a novel framework that bridges time-independent semantics and proximal dynamics through three interlocking mechanisms. First, to address semantic forgetting, we propose a Dual Memory KV Cache that decouples local temporal dynamics from global semantic anchors, ensuring long-term semantic coherence and identity stability. Second, to suppress visual drift, we design Dual-Reference RoPE Injection, which confines positional embeddings within the training manifold while…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.