Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments
Zhiyuan Li, Yanfeng Lu, Yao Mu, Hong Qiao

TL;DR
Cog-GA introduces a large language model-based generative agent for vision-language navigation in continuous 3D environments, employing cognitive mapping, waypoint prediction, and reflective mechanisms to enhance navigation efficiency and interpretability.
Contribution
This paper presents Cog-GA, a novel LLM-based agent that constructs cognitive maps and predicts waypoints, advancing the capabilities of VLN-CE agents with human-like reasoning.
Findings
Achieves state-of-the-art performance on VLN-CE benchmarks.
Demonstrates effective spatial reasoning and navigation efficiency.
Shows improved interpretability through dual-channel scene descriptions.
Abstract
Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI, demanding agents to navigate freely in unbounded 3D spaces solely guided by natural language instructions. This task introduces distinct challenges in multimodal comprehension, spatial reasoning, and decision-making. To address these challenges, we introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes. Firstly, it constructs a cognitive map, integrating temporal, spatial, and semantic elements, thereby facilitating the development of spatial memory within LLMs. Secondly, Cog-GA employs a predictive mechanism for waypoints, strategically optimizing the exploration trajectory to maximize navigational efficiency. Each waypoint is accompanied by a dual-channel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Speech and dialogue systems
