GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Renjie Gu; Kaichen Zhou; Yan Luo; Mengyu Wang

arXiv:2605.16713·cs.CV·May 19, 2026

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

PDF

TL;DR

GeoWorld-VLM enhances vision-language models' spatial reasoning by distilling 3D geometric cues from world models into the visual pathway, improving spatial understanding without altering language capabilities.

Contribution

Introduces a novel distillation framework that transfers geometric information from world models into VLMs, improving spatial reasoning across different architectures and datasets.

Findings

01

Improves spatial reasoning performance by ~4% on benchmarks.

02

Enhances visual pathway with geometric cues without affecting language understanding.

03

Demonstrates generality across multiple VLM architectures.

Abstract

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.