World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

Wanyue Zhang; Wenxiang Wu; Wang Xu; Jiaxin Luo; Helu Zhi; Yibin Huang; Shuo Ren; Zitao Liu; Jiajun Zhang

arXiv:2604.26934·cs.CV·April 30, 2026

World2VLM: Distilling World Model Imagination into VLMs for Dynamic Spatial Reasoning

Wanyue Zhang, Wenxiang Wu, Wang Xu, Jiaxin Luo, Helu Zhi, Yibin Huang, Shuo Ren, Zitao Liu, Jiajun Zhang

PDF

1 Datasets

TL;DR

This paper introduces World2VLM, a training framework that distills spatial imagination from a world model into vision-language models, enhancing dynamic spatial reasoning without costly inference-time computations.

Contribution

It presents a novel training approach that enables VLMs to internalize spatial imagination using a distillation process from a generative world model.

Findings

01

World2VLM improves performance on multiple spatial reasoning benchmarks.

02

It outperforms inference-time world model coupling methods.

03

The approach eliminates the need for expensive inference-time generation.

Abstract

Vision-language models (VLMs) have shown strong performance on static visual understanding, yet they still struggle with dynamic spatial reasoning that requires imagining how scenes evolve under egocentric motion. Recent efforts address this limitation either by scaling spatial supervision with synthetic data or by coupling VLMs with world models at inference time. However, the former often lacks explicit modeling of motion-conditioned state transitions, while the latter incurs substantial computational overhead. In this work, we propose World2VLM, a training framework that distills spatial imagination from a generative world model into a vision-language model. Given an initial observation and a parameterized camera trajectory, we use a view-consistent world model to synthesize geometrically aligned future views and derive structured supervision for both forward (action-to-outcome) and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

WanyueZhang/World2VLM
dataset· 189 dl
189 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.