TL;DR
This paper reveals that vision-language models contain a latent 3D topological map of scenes, which can be mathematically shaped and regularized to improve spatial reasoning tasks.
Contribution
It uncovers the geometric nature of VLMs' latent representations and introduces a regularization method that enhances spatial understanding.
Findings
Latent topological maps in VLMs are overshadowed by visual semantics.
A linear feature extraction isolates a pure spatial subspace.
Regularization improves spatial task performance by up to 12.1%.
Abstract
Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model's spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene's 3D Gaussian-kernel graph, converging to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
