TL;DR
WorldKV is a novel framework that enhances persistent world modeling in autoregressive video diffusion by combining retrieval and compression techniques, achieving high fidelity and throughput without fine-tuning.
Contribution
It introduces a training-free approach with World Retrieval and World Compression to maintain long-term consistency efficiently.
Findings
Matches or exceeds full-KV memory fidelity
Doubles throughput compared to full-KV methods
Operates without fine-tuning, competitive with trained baselines
Abstract
Autoregressive video diffusion models have enabled real-time, action-conditioned world generation. However, sustaining a persistent world, where revisiting a previously seen viewpoint yields consistent content, remains an open problem. Full KV-cache attention preserves this consistency but breaks real-time constraints: memory footprint and attention cost grow linearly with rollout length. Sliding window inference restores throughput but discards long-term consistency. We propose WorldKV, a training-free framework with two components: World Retrieval and World Compression. World Retrieval stores evicted KV-cache chunks in GPU/CPU memory and selectively retrieves scene-relevant chunks via camera/ action correspondence, inserting them back into the native attention window without re-encoding. World Compression prunes redundant tokens within each chunk via key-key similarity to an anchor…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
