Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

Haoming Wang; Wei Gao

arXiv:2605.07148·cs.CV·May 11, 2026

Uncovering and Shaping the Latent Representation of 3D Scene Topology in Vision-Language Models

Haoming Wang, Wei Gao

PDF

1 Repo

TL;DR

This paper reveals that vision-language models contain a latent 3D topological map of scenes, which can be mathematically shaped and regularized to improve spatial reasoning tasks.

Contribution

It uncovers the geometric nature of VLMs' latent representations and introduces a regularization method that enhances spatial understanding.

Findings

01

Latent topological maps in VLMs are overshadowed by visual semantics.

02

A linear feature extraction isolates a pure spatial subspace.

03

Regularization improves spatial task performance by up to 12.1%.

Abstract

Decades of cognitive science establish that humans navigate environments by forming cognitive maps, defined as allocentric and topology-preserving representations of 3D space. While modern Vision-Language Models (VLMs) demonstrate emergent spatial reasoning from 2D egocentric inputs, it remains unclear whether they construct an analogous 3D internal representation. In this paper, we demonstrate that current VLMs do possess a latent topological map of 3D scenes, but it is heavily overshadowed by non-geometric visual semantics, such as color and shape. By isolating this spatial subspace through cross-scene linear feature extraction, we extract a clean spatial subspace that causally controls the model's spatial outputs. We mathematically shape this latent representation and prove its correspondence to the Laplacian eigenmaps of the scene's 3D Gaussian-kernel graph, converging to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pittisl/vlm-latent-shaping
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.