MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains
Zhaohuan Zhan, Lisha Yu, Sijie Yu, Guang Tan

TL;DR
This paper introduces MC-GPT, a novel approach that combines memory maps and reasoning chains to improve vision-and-language navigation, making it more effective and interpretable by leveraging LLMs and navigation history.
Contribution
The paper proposes a topological memory map and a navigation chain of thoughts module to enhance navigation strategies and interpretability in VLN tasks using LLMs.
Findings
Improved navigation accuracy on REVERIE and R2R datasets.
Enhanced interpretability of navigation reasoning.
Effective integration of memory and strategy modules.
Abstract
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction. While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability. Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities. However, existing LLM-based methods face limitations in memory construction and diversity of navigation strategies. To address these challenges, we propose a suite of techniques. Firstly, we introduce a method to maintain a topological map that stores navigation history, retaining information about viewpoints, objects, and their spatial relationships. This map also serves as a global action space. Additionally, we present a Navigation Chain of Thoughts module, leveraging human…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotics and Automated Systems · Constraint Satisfaction and Optimization · AI-based Problem Solving and Planning
