WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling
Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

TL;DR
WorldPack introduces an efficient compressed memory system for video world models, significantly enhancing long-term spatial consistency and visual fidelity in long-term video generation despite shorter context lengths.
Contribution
The paper presents WorldPack, a novel video world model that uses trajectory packing and memory retrieval to improve long-term spatial consistency with reduced computational costs.
Findings
Outperforms state-of-the-art models on LoopNav benchmark
Achieves higher spatial consistency in long-term video generation
Maintains quality with shorter context lengths
Abstract
Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our…
Peer Reviews
Decision·Submitted to ICLR 2026
1. The proposed idea of combining trajectory packing with retrieval-based compression intuitively addresses the limitations of fixed-context world models. 2. Retrieval formulation does not rely on camera intrinsics and explicitly handles opposite-facing views at characteristic distances, improving practical applicability in game/sim settings. 3. Consistent gains on LoopNav across metrics and qualitative rollouts, with some ablations isolating the impact of retrieval vs packing and showing their
1. My main concern is the latency/memory cost of the introduced retrieval-based compression approach, because it introduces many additional operations. Although claimed as “computationally efficient,” the paper does not analyze actual memory savings or inference latencies of the compressed memory versus previous baselines. 2. Evaluation is confined to a single simulator benchmark (Minecraft/LoopNav), and it is more important to test beyond simulators and towards real-world data to validate gener
1. The proposed method is technically sound, and the experimental results clearly demonstrate its superiority over provided baseline approaches. 2. The paper is well written and easy to follow.
The technical novelty of the paper is somewhat limited. The central idea, importance-based frame compression, has already been explored in next-frame prediction and video diffusion models (FramePack). The paper’s main contribution appears to be in adapting this idea to the world modeling context through specific retrieval strategies and benchmark evaluations. - I suggest that FramePack be clearly highlighted in the Preliminaries section to better situate this work within existing literature and
Introduces a compressed memory approach (trajectory packing + memory retrieval) that creatively addresses long-horizon consistency without increasing context length. Empirically outperforms baselines on LoopNav, improving spatial consistency, fidelity, and long-term generation quality.
**Limited innovation in memory design**: The proposed trajectory packing and memory retrieval are common ideas that have been widely used, including in RNNs and LSTMs via memory banks. The paper does not articulate a fundamentally new algorithmic principle or provide a theoretically grounded formulation that distinguishes its approach from prior methods. Moreover, the related work section lacks a thorough discussion of these connections. **Incremental engineering on existing backbones**: Using
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Data Management and Algorithms
