MapGPT: Map-Guided Prompting with Adaptive Path Planning for Vision-and-Language Navigation
Jiaqi Chen, Bingqian Lin, Ran Xu, Zhenhua Chai, Xiaodan Liang,, Kwan-Yee K. Wong

TL;DR
MapGPT introduces a novel map-guided prompting approach with adaptive path planning for vision-and-language navigation, significantly improving zero-shot performance and enabling global environment understanding in embodied agents.
Contribution
The paper presents a new map-guided GPT-based agent with online maps and adaptive planning, enhancing global exploration and path planning in VLN tasks.
Findings
Achieves state-of-the-art zero-shot performance on R2R and REVERIE (~10% and ~12% SR improvements)
Demonstrates emergent global thinking and path planning abilities in GPT-based agents
Applicable to both GPT-4 and GPT-4V models
Abstract
Embodied agents equipped with GPT as their brains have exhibited extraordinary decision-making and generalization abilities across various tasks. However, existing zero-shot agents for vision-and-language navigation (VLN) only prompt GPT-4 to select potential locations within localized environments, without constructing an effective "global-view" for the agent to understand the overall environment. In this work, we present a novel map-guided GPT-based agent, dubbed MapGPT, which introduces an online linguistic-formed map to encourage global exploration. Specifically, we build an online map and incorporate it into the prompts that include node information and topological relationships, to help GPT understand the spatial environment. Benefiting from this design, we further propose an adaptive planning mechanism to assist the agent in performing multi-step path planning based on a map,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
MethodsLabel Smoothing · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Transformer · GPT-4 · Refunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dropout
