Real-Time 3D Vision-Language Embedding Mapping
Christian Rauch, Bj\"orn Ellensohn, Linus Nwankwo, Vedant Dave, Elmar Rueckert

TL;DR
This paper introduces a real-time method for integrating vision-language embeddings into 3D representations, enabling accurate, task-agnostic semantic mapping for robotic applications.
Contribution
It presents a novel approach combining local embedding masking and confidence-weighted 3D integration for real-time, metric-accurate semantic 3D mapping using vision-language models.
Findings
Achieves more accurate object localization in real-world sequences
Improves runtime performance for real-time applications
Demonstrates versatility across robotic tasks
Abstract
A metric-accurate semantic 3D representation is essential for many robotic tasks. This work proposes a simple, yet powerful, way to integrate the 2D embeddings of a Vision-Language Model in a metric-accurate 3D representation at real-time. We combine a local embedding masking strategy, for a more distinct embedding distribution, with a confidence-weighted 3D integration for more reliable 3D embeddings. The resulting metric-accurate embedding representation is task-agnostic and can represent semantic concepts on a global multi-room, as well as on a local object-level. This enables a variety of interactive robotic applications that require the localisation of objects-of-interest via natural language. We evaluate our approach on a variety of real-world sequences and demonstrate that these strategies achieve a more accurate object-of-interest localisation while improving the runtime…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization
