StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

Nedko Savov; Naser Kazemi; Deheng Zhang; Danda Pani Paudel; Xi Wang; Luc Van Gool

arXiv:2505.22246·cs.CV·November 3, 2025

StateSpaceDiffuser: Bringing Long Context to Diffusion World Models

Nedko Savov, Naser Kazemi, Deheng Zhang, Danda Pani Paudel, Xi Wang, Luc Van Gool

PDF

Open Access

TL;DR

StateSpaceDiffuser enhances diffusion-based world models by integrating state-space representations, enabling long-term memory and improved temporal coherence in visual predictions across extended sequences.

Contribution

It introduces a novel method combining diffusion models with state-space representations to maintain long-term context in world modeling.

Findings

01

Outperforms diffusion-only baselines in long-term coherence

02

Maintains visual consistency over extended sequences

03

Effective in both 2D maze and 3D environments

Abstract

World models have recently gained prominence for action-conditioned visual prediction in complex environments. However, relying on only a few recent observations causes them to lose long-term context. Consequently, within a few steps, the generated scenes drift from what was previously observed, undermining temporal coherence. This limitation, common in state-of-the-art world models, which are diffusion-based, stems from the lack of a lasting environment state. To address this problem, we introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model, representing the entire interaction history. This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models. To rigorously measure temporal consistency, we develop an evaluation protocol that probes a model's ability to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Pose and Action Recognition

MethodsDiffusion