Real-Time 3D Vision-Language Embedding Mapping

Christian Rauch; Bj\"orn Ellensohn; Linus Nwankwo; Vedant Dave; Elmar Rueckert

arXiv:2508.06291·cs.RO·August 11, 2025

Real-Time 3D Vision-Language Embedding Mapping

Christian Rauch, Bj\"orn Ellensohn, Linus Nwankwo, Vedant Dave, Elmar Rueckert

PDF

Open Access

TL;DR

This paper introduces a real-time method for integrating vision-language embeddings into 3D representations, enabling accurate, task-agnostic semantic mapping for robotic applications.

Contribution

It presents a novel approach combining local embedding masking and confidence-weighted 3D integration for real-time, metric-accurate semantic 3D mapping using vision-language models.

Findings

01

Achieves more accurate object localization in real-world sequences

02

Improves runtime performance for real-time applications

03

Demonstrates versatility across robotic tasks

Abstract

A metric-accurate semantic 3D representation is essential for many robotic tasks. This work proposes a simple, yet powerful, way to integrate the 2D embeddings of a Vision-Language Model in a metric-accurate 3D representation at real-time. We combine a local embedding masking strategy, for a more distinct embedding distribution, with a confidence-weighted 3D integration for more reliable 3D embeddings. The resulting metric-accurate embedding representation is task-agnostic and can represent semantic concepts on a global multi-room, as well as on a local object-level. This enables a variety of interactive robotic applications that require the localisation of objects-of-interest via natural language. We evaluate our approach on a variety of real-world sequences and demonstrate that these strategies achieve a more accurate object-of-interest localisation while improving the runtime…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization