GeoWorld-VLM: Geometry from World Models for Vision-Language Models
Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

TL;DR
GeoWorld-VLM enhances vision-language models' spatial reasoning by distilling 3D geometric cues from world models into the visual pathway, improving spatial understanding without altering language capabilities.
Contribution
Introduces a novel distillation framework that transfers geometric information from world models into VLMs, improving spatial reasoning across different architectures and datasets.
Findings
Improves spatial reasoning performance by ~4% on benchmarks.
Enhances visual pathway with geometric cues without affecting language understanding.
Demonstrates generality across multiple VLM architectures.
Abstract
Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
