OVI-MAP:Open-Vocabulary Instance-Semantic Mapping
Zilong Deng, Federico Tombari, Marc Pollefeys, Johanna Wald, Daniel Barath

TL;DR
OVI-MAP is a real-time open-vocabulary 3D mapping system that decouples instance reconstruction from semantic inference, enabling stable, zero-shot semantic labeling in complex environments.
Contribution
It introduces a novel decoupled approach that constructs class-agnostic 3D instance maps and uses vision-language models for flexible semantic inference.
Findings
Outperforms state-of-the-art open-vocabulary mapping methods.
Operates in real time during online exploration.
Enables stable instance tracking and zero-shot semantic labeling.
Abstract
Incremental open-vocabulary 3D instance-semantic mapping is essential for autonomous agents operating in complex everyday environments. However, it remains challenging due to the need for robust instance segmentation, real-time processing, and flexible open-set reasoning. Existing methods often rely on the closed-set assumption or dense per-pixel language fusion, which limits scalability and temporal consistency. We introduce OVI-MAP that decouples instance reconstruction from semantic inference. We propose to build a class-agnostic 3D instance map that is incrementally constructed from RGB-D input, while semantic features are extracted only from a small set of automatically selected views using vision-language models. This design enables stable instance tracking and zero-shot semantic labeling throughout online exploration. Our system operates in real time and outperforms…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
