OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping

Danyang Li; Zenghui Yang; Guangpeng Qi; Songtao Pang; Guangyong Shang; Qiang Ma; Zheng Yang

arXiv:2508.01723·cs.RO·August 5, 2025

OpenMap: Instruction Grounding via Open-Vocabulary Visual-Language Mapping

Danyang Li, Zenghui Yang, Guangpeng Qi, Songtao Pang, Guangyong Shang, Qiang Ma, Zheng Yang

PDF

TL;DR

OpenMap is a zero-shot visual-language mapping system that improves instruction grounding for navigation by combining structural-semantic constraints and large language model assistance, outperforming existing methods in complex 3D environments.

Contribution

The paper introduces OpenMap, a novel open-vocabulary mapping approach that enhances instruction grounding through structural-semantic constraints and LLM-assisted instance selection.

Findings

01

Outperforms state-of-the-art in zero-shot instruction grounding

02

Effective in 3D semantic mapping and retrieval tasks

03

Demonstrates robustness across diverse indoor environments

Abstract

Grounding natural language instructions to visual observations is fundamental for embodied agents operating in open-world environments. Recent advances in visual-language mapping have enabled generalizable semantic representations by leveraging vision-language models (VLMs). However, these methods often fall short in aligning free-form language commands with specific scene instances, due to limitations in both instance-level semantic consistency and instruction interpretation. We present OpenMap, a zero-shot open-vocabulary visual-language map designed for accurate instruction grounding in navigation tasks. To address semantic inconsistencies across views, we introduce a Structural-Semantic Consensus constraint that jointly considers global geometric structure and vision-language similarity to guide robust 3D instance-level aggregation. To improve instruction interpretation, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.