TL;DR
This paper introduces OVO, an efficient online 3D semantic mapping system that uses CLIP vectors for segmentation, integrated with SLAM backbones, enabling real-time open-vocabulary mapping with improved performance and lower resource use.
Contribution
The paper presents a novel online 3D semantic mapping pipeline using CLIP vectors and a new merging method, achieving better segmentation and efficiency than offline methods.
Findings
Lower computational and memory footprint compared to offline baselines.
Superior segmentation metrics over offline and online methods.
Successful integration with SLAM backbones for end-to-end online mapping.
Abstract
This paper presents an Open-Vocabulary Online 3D semantic mapping pipeline, that we denote by its acronym OVO. Given a sequence of posed RGB-D frames, we detect and track 3D segments, which we describe using CLIP vectors. These are computed from the viewpoints where they are observed by a novel CLIP merging method. Notably, our OVO has a significantly lower computational and memory footprint than offline baselines, while also showing better segmentation metrics than offline and online ones. Along with superior segmentation performance, we also show experimental results of our mapping contributions integrated with two different full SLAM backbones (Gaussian-SLAM and ORB-SLAM2), being the first ones using a neural network to merge CLIP descriptors and demonstrating end-to-end open-vocabulary online 3D mapping with loop closure.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSparse Evolutionary Training · Contrastive Language-Image Pre-training
