OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

Zilong Deng; Federico Tombari; Marc Pollefeys; Johanna Wald; Daniel Barath

arXiv:2603.26541·cs.CV·March 30, 2026

OVI-MAP:Open-Vocabulary Instance-Semantic Mapping

Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath

PDF

TL;DR

OVI-MAP is a real-time open-vocabulary 3D mapping system that decouples instance reconstruction from semantic inference, enabling stable, zero-shot semantic labeling in complex environments.

Contribution

It introduces a novel decoupled approach that constructs class-agnostic 3D instance maps and uses vision-language models for flexible semantic inference.

Findings

01

Outperforms state-of-the-art open-vocabulary mapping methods.

02

Operates in real time during online exploration.

03

Enables stable instance tracking and zero-shot semantic labeling.

Abstract

Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.