Long-Context State-Space Video World Models

Ryan Po; Yotam Nitzan; Richard Zhang; Berlin Chen; Tri Dao; Eli Shechtman; Gordon Wetzstein; Xun Huang

arXiv:2505.20171·cs.CV·May 27, 2025

Long-Context State-Space Video World Models

Ryan Po, Yotam Nitzan, Richard Zhang, Berlin Chen, Tri Dao, Eli Shechtman, Gordon Wetzstein, Xun Huang

PDF

Open Access

TL;DR

This paper introduces a novel long-context video world model using state-space models to improve long-term memory in video prediction, achieving better performance on extended horizon tasks while maintaining efficiency.

Contribution

The authors propose a new architecture that leverages state-space models with a block-wise scanning scheme and dense local attention for long-term video modeling.

Findings

01

Outperforms baselines in long-range memory tasks

02

Maintains practical inference speeds

03

Effective in spatial retrieval and reasoning over extended horizons

Abstract

Video diffusion models have recently shown promise for world modeling through autoregressive frame prediction conditioned on actions. However, they struggle to maintain long-term memory due to the high computational cost associated with processing extended sequences in attention layers. To overcome this limitation, we propose a novel architecture leveraging state-space models (SSMs) to extend temporal memory without compromising computational efficiency. Unlike previous approaches that retrofit SSMs for non-causal vision tasks, our method fully exploits the inherent advantages of SSMs in causal sequence modeling. Central to our design is a block-wise SSM scanning scheme, which strategically trades off spatial consistency for extended temporal memory, combined with dense local attention to ensure coherence between consecutive frames. We evaluate the long-term memory capabilities of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSimulation Techniques and Applications

MethodsSoftmax · Attention Is All You Need · Diffusion