CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in   Large-Scale Scenes with Gaussian Representation

Qi Ma; Runyi Yang; Bin Ren; Nicu Sebe; Ender Konukoglu; Luc Van Gool,; Danda Pani Paudel

arXiv:2501.08982·cs.CV·February 4, 2025

CityLoc: 6DoF Pose Distributional Localization for Text Descriptions in Large-Scale Scenes with Gaussian Representation

Qi Ma, Runyi Yang, Bin Ren, Nicu Sebe, Ender Konukoglu, Luc Van Gool,, Danda Pani Paudel

PDF

Open Access

TL;DR

CityLoc introduces a novel diffusion-based approach that generates pose distributions conditioned on textual descriptions, enabling robust localization in large-scale scenes by leveraging vision-language models and 3D Gaussian rendering.

Contribution

The paper presents a new method combining diffusion models, CLIP, and Gaussian splatting to improve text-based localization accuracy in large-scale 3D environments.

Findings

01

Outperforms standard distribution estimation methods across five datasets

02

Achieves higher localization accuracy with Gaussian rendering

03

Effectively handles ambiguous and broad textual descriptions

Abstract

Localizing textual descriptions within large-scale 3D scenes presents inherent ambiguities, such as identifying all traffic lights in a city. Addressing this, we introduce a method to generate distributions of camera poses conditioned on textual descriptions, facilitating robust reasoning for broadly defined concepts. Our approach employs a diffusion-based architecture to refine noisy 6DoF camera poses towards plausible locations, with conditional signals derived from pre-trained text encoders. Integration with the pretrained Vision-Language Model, CLIP, establishes a strong linkage between text descriptions and pose distributions. Enhancement of localization accuracy is achieved by rendering candidate poses using 3D Gaussian splatting, which corrects misaligned samples through visual reasoning. We validate our method's superiority by comparing it against standard distribution…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGeographic Information Systems Studies

MethodsALIGN · Contrastive Language-Image Pre-training