Multimodal Spatial Language Maps for Robot Navigation and Manipulation

Chenguang Huang; Oier Mees; Andy Zeng; Wolfram Burgard

arXiv:2506.06862·cs.RO·June 10, 2025

Multimodal Spatial Language Maps for Robot Navigation and Manipulation

Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard

PDF

Open Access

TL;DR

This paper introduces multimodal spatial language maps that fuse visual, audio, and language data with 3D environment reconstructions, enabling robots to understand and execute complex spatial goals in navigation and manipulation tasks.

Contribution

It presents a novel spatial map representation integrating pretrained multimodal features, enabling zero-shot goal localization and cross-embodiment sharing for robotic navigation and manipulation.

Findings

01

Enables zero-shot spatial goal navigation in simulation and real-world

02

Improves goal disambiguation by 50% in ambiguous environments

03

Supports navigation and interaction using multimodal cues across robot types

Abstract

Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI