WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

Yuta Oshima; Yusuke Iwasawa; Masahiro Suzuki; Yutaka Matsuo; Hiroki Furuta

arXiv:2512.02473·cs.CV·December 3, 2025

WorldPack: Compressed Memory Improves Spatial Consistency in Video World Modeling

Yuta Oshima, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

PDF

Open Access 3 Reviews

TL;DR

WorldPack introduces an efficient compressed memory system for video world models, significantly enhancing long-term spatial consistency and visual fidelity in long-term video generation despite shorter context lengths.

Contribution

The paper presents WorldPack, a novel video world model that uses trajectory packing and memory retrieval to improve long-term spatial consistency with reduced computational costs.

Findings

01

Outperforms state-of-the-art models on LoopNav benchmark

02

Achieves higher spatial consistency in long-term video generation

03

Maintains quality with shorter context lengths

Abstract

Video world models have attracted significant attention for their ability to produce high-fidelity future visual observations conditioned on past observations and navigation actions. Temporally- and spatially-consistent, long-term world modeling has been a long-standing problem, unresolved with even recent state-of-the-art models, due to the prohibitively expensive computational costs for long-context inputs. In this paper, we propose WorldPack, a video world model with efficient compressed memory, which significantly improves spatial consistency, fidelity, and quality in long-term generation despite much shorter context length. Our compressed memory consists of trajectory packing and memory retrieval; trajectory packing realizes high context efficiency, and memory retrieval maintains the consistency in rollouts and helps long-term generations that require spatial reasoning. Our…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 5

Strengths

1. The proposed idea of combining trajectory packing with retrieval-based compression intuitively addresses the limitations of fixed-context world models. 2. Retrieval formulation does not rely on camera intrinsics and explicitly handles opposite-facing views at characteristic distances, improving practical applicability in game/sim settings. 3. Consistent gains on LoopNav across metrics and qualitative rollouts, with some ablations isolating the impact of retrieval vs packing and showing their

Weaknesses

1. My main concern is the latency/memory cost of the introduced retrieval-based compression approach, because it introduces many additional operations. Although claimed as “computationally efficient,” the paper does not analyze actual memory savings or inference latencies of the compressed memory versus previous baselines. 2. Evaluation is confined to a single simulator benchmark (Minecraft/LoopNav), and it is more important to test beyond simulators and towards real-world data to validate gener

Reviewer 02Rating 4Confidence 4

Strengths

1. The proposed method is technically sound, and the experimental results clearly demonstrate its superiority over provided baseline approaches. 2. The paper is well written and easy to follow.

Weaknesses

The technical novelty of the paper is somewhat limited. The central idea, importance-based frame compression, has already been explored in next-frame prediction and video diffusion models (FramePack). The paper’s main contribution appears to be in adapting this idea to the world modeling context through specific retrieval strategies and benchmark evaluations. - I suggest that FramePack be clearly highlighted in the Preliminaries section to better situate this work within existing literature and

Reviewer 03Rating 2Confidence 4

Strengths

Introduces a compressed memory approach (trajectory packing + memory retrieval) that creatively addresses long-horizon consistency without increasing context length. Empirically outperforms baselines on LoopNav, improving spatial consistency, fidelity, and long-term generation quality.

Weaknesses

**Limited innovation in memory design**: The proposed trajectory packing and memory retrieval are common ideas that have been widely used, including in RNNs and LSTMs via memory banks. The paper does not articulate a fundamentally new algorithmic principle or provide a theoretically grounded formulation that distinguishes its approach from prior methods. Moreover, the related work section lacks a thorough discussion of these connections. **Incremental engineering on existing backbones**: Using

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Robotics and Sensor-Based Localization · Data Management and Algorithms