VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

Shuhao Kang; Youqi Liao; Peijie Wang; Wenlong Liao; Qilin Zhang; Benjamin Busam; Xieyuanli Chen; Yun Liu

arXiv:2603.09826·cs.CV·March 11, 2026

VLM-Loc: Localization in Point Cloud Maps via Vision-Language Models

Shuhao Kang, Youqi Liao, Peijie Wang, Wenlong Liao, Qilin Zhang, Benjamin Busam, Xieyuanli Chen, Yun Liu

PDF

Open Access 1 Datasets

TL;DR

VLM-Loc introduces a novel framework that uses vision-language models with spatial reasoning to improve text-based localization in 3D point cloud maps, outperforming existing methods.

Contribution

The paper presents VLM-Loc, a new approach that leverages large vision-language models with structured spatial representations for accurate point cloud localization from natural language.

Findings

01

VLM-Loc achieves higher accuracy than previous methods.

02

The approach demonstrates robustness across diverse environments.

03

CityLoc benchmark enables systematic evaluation of T2P localization.

Abstract

Text-to-point-cloud (T2P) localization aims to infer precise spatial positions within 3D point cloud maps from natural language descriptions, reflecting how humans perceive and communicate spatial layouts through language. However, existing methods largely rely on shallow text-point cloud correspondence without effective spatial reasoning, limiting their accuracy in complex environments. To address this limitation, we propose VLM-Loc, a framework that leverages the spatial reasoning capability of large vision-language models (VLMs) for T2P localization. Specifically, we transform point clouds into bird's-eye-view (BEV) images and scene graphs that jointly encode geometric and semantic context, providing structured inputs for the VLM to learn cross-modal representations bridging linguistic and spatial semantics. On top of these representations, we introduce a partial node assignment…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

kang233/VLM-Loc
dataset· 37 dl
37 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization