MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

Zhendong Xiao; Wu Wei; Shujie Ji; Shan Yang; Changhao Chen

arXiv:2507.04509·cs.CV·July 8, 2025

MVL-Loc: Leveraging Vision-Language Model for Generalizable Multi-Scene Camera Relocalization

Zhendong Xiao, Wu Wei, Shujie Ji, Shan Yang, Changhao Chen

PDF

TL;DR

MVL-Loc introduces a vision-language model-based framework for accurate, robust, and generalizable multi-scene camera relocalization across indoor and outdoor environments, utilizing multimodal data and natural language guidance.

Contribution

It is the first to leverage pretrained vision-language models and natural language directives for multi-scene camera relocalization, enhancing generalization and semantic understanding.

Findings

01

Achieves state-of-the-art accuracy on 7Scenes and Cambridge Landmarks datasets.

02

Demonstrates robustness across diverse indoor and outdoor environments.

03

Improves pose estimation accuracy compared to traditional methods.

Abstract

Camera relocalization, a cornerstone capability of modern computer vision, accurately determines a camera's position and orientation (6-DoF) from images and is essential for applications in augmented reality (AR), mixed reality (MR), autonomous driving, delivery drones, and robotic navigation. Unlike traditional deep learning-based methods that regress camera pose from images in a single scene, which often lack generalization and robustness in diverse environments, we propose MVL-Loc, a novel end-to-end multi-scene 6-DoF camera relocalization framework. MVL-Loc leverages pretrained world knowledge from vision-language models (VLMs) and incorporates multimodal data to generalize across both indoor and outdoor settings. Furthermore, natural language is employed as a directive tool to guide the multi-scene learning process, facilitating semantic understanding of complex scenes and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.