Grounding World Simulation Models in a Real-World Metropolis

Junyoung Seo; Hyunwook Choi; Minkyung Kwon; Jinhyeok Choi; Siyoon Jin; Gayoung Lee; Junho Kim; JoungBin Lee; Geonmo Gu; Dongyoon Han; Sangdoo Yun; Seungryong Kim; Jin-Hwa Kim

arXiv:2603.15583·cs.CV·March 17, 2026

Grounding World Simulation Models in a Real-World Metropolis

Junyoung Seo, Hyunwook Choi, Minkyung Kwon, Jinhyeok Choi, Siyoon Jin, Gayoung Lee, Junho Kim, JoungBin Lee, Geonmo Gu, Dongyoon Han, Sangdoo Yun, Seungryong Kim, Jin-Hwa Kim

PDF

Open Access

TL;DR

This paper introduces Seoul World Model (SWM), a city-scale video generation model grounded in real-world Seoul data, capable of producing long, diverse, and spatially faithful urban videos with temporal consistency.

Contribution

The paper presents SWM, a novel city-scale world model that integrates retrieval-augmented conditioning, synthetic data generation, and a Virtual Lookahead Sink for stable, long-horizon urban video synthesis.

Findings

01

SWM outperforms existing models in urban video fidelity.

02

Supports diverse camera trajectories and text prompts.

03

Achieves long-horizon, temporally consistent city videos.

Abstract

What if a world simulation model could render not an imagined environment but a city that actually exists? Prior generative world models synthesize visually plausible yet artificial environments by imagining all content. We present Seoul World Model (SWM), a city-scale world model grounded in the real city of Seoul. SWM anchors autoregressive video generation through retrieval-augmented conditioning on nearby street-view images. However, this design introduces several challenges, including temporal misalignment between retrieved references and the dynamic target scene, limited trajectory diversity and data sparsity from vehicle-mounted captures at sparse intervals. We address these challenges through cross-temporal pairing, a large-scale synthetic dataset enabling diverse camera trajectories, and a view interpolation pipeline that synthesizes coherent training videos from sparse…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Human Motion and Animation · Advanced Vision and Imaging