Multimodal Spatial Language Maps for Robot Navigation and Manipulation
Chenguang Huang, Oier Mees, Andy Zeng, Wolfram Burgard

TL;DR
This paper introduces multimodal spatial language maps that fuse visual, audio, and language data with 3D environment reconstructions, enabling robots to understand and execute complex spatial goals in navigation and manipulation tasks.
Contribution
It presents a novel spatial map representation integrating pretrained multimodal features, enabling zero-shot goal localization and cross-embodiment sharing for robotic navigation and manipulation.
Findings
Enables zero-shot spatial goal navigation in simulation and real-world
Improves goal disambiguation by 50% in ambiguous environments
Supports navigation and interaction using multimodal cues across robot types
Abstract
Grounding language to a navigating agent's observations can leverage pretrained multimodal foundation models to match perceptions to object or event descriptions. However, previous approaches remain disconnected from environment mapping, lack the spatial precision of geometric maps, or neglect additional modality information beyond vision. To address this, we propose multimodal spatial language maps as a spatial map representation that fuses pretrained multimodal features with a 3D reconstruction of the environment. We build these maps autonomously using standard exploration. We present two instances of our maps, which are visual-language maps (VLMaps) and their extension to audio-visual-language maps (AVLMaps) obtained by adding audio information. When combined with large language models (LLMs), VLMaps can (i) translate natural language commands into open-vocabulary spatial goals…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems · Social Robot Interaction and HRI
