KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

Zaid Nasser; Mikhail Iumanov; Tianhao Li; Maxim Popov; Jaafar Mahmoud; Malik Mohrat; Ilya Obrubov; Ekaterina Derevyanka; Ivan Sosin; Sergey Kolyubin

arXiv:2512.01889·cs.CV·December 2, 2025

KM-ViPE: Online Tightly Coupled Vision-Language-Geometry Fusion for Open-Vocabulary Semantic SLAM

Zaid Nasser, Mikhail Iumanov, Tianhao Li, Maxim Popov, Jaafar Mahmoud, Malik Mohrat, Ilya Obrubov, Ekaterina Derevyanka, Ivan Sosin, Sergey Kolyubin

PDF

Open Access

TL;DR

KM-ViPE is a real-time, open-vocabulary SLAM system that fuses visual, geometric, and language features for dynamic environments using monocular cameras, suitable for robotics and AR/VR.

Contribution

It introduces a novel online SLAM framework that tightly integrates visual, geometric, and language features without requiring depth sensors or offline calibration.

Findings

01

Competitive with state-of-the-art SLAM methods

02

Handles dynamic scenes with moving objects effectively

03

Operates in real-time on uncalibrated monocular cameras

Abstract

We present KM-ViPE (Knowledge Mapping Video Pose Engine), a real-time open-vocabulary SLAM framework for uncalibrated monocular cameras in dynamic environments. Unlike systems requiring depth sensors and offline calibration, KM-ViPE operates directly on raw RGB streams, making it ideal for ego-centric applications and harvesting internet-scale video data for training. KM-ViPE tightly couples DINO visual features with geometric constraints through a high-level features based adaptive robust kernel that handles both moving objects and movable static objects (e.g., moving furniture in ego-centric views). The system performs simultaneous online localization and open-vocabulary semantic mapping by fusing geometric and deep visual features aligned with language embeddings. Our results are competitive with state-of-the-art approaches, while existing solutions either operate offline, need depth…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotics and Sensor-Based Localization · Multimodal Machine Learning Applications · Advanced Vision and Imaging