Cog-GA: A Large Language Models-based Generative Agent for   Vision-Language Navigation in Continuous Environments

Zhiyuan Li; Yanfeng Lu; Yao Mu; Hong Qiao

arXiv:2409.02522·cs.AI·September 24, 2024

Cog-GA: A Large Language Models-based Generative Agent for Vision-Language Navigation in Continuous Environments

Zhiyuan Li, Yanfeng Lu, Yao Mu, Hong Qiao

PDF

Open Access

TL;DR

Cog-GA introduces a large language model-based generative agent for vision-language navigation in continuous 3D environments, employing cognitive mapping, waypoint prediction, and reflective mechanisms to enhance navigation efficiency and interpretability.

Contribution

This paper presents Cog-GA, a novel LLM-based agent that constructs cognitive maps and predicts waypoints, advancing the capabilities of VLN-CE agents with human-like reasoning.

Findings

01

Achieves state-of-the-art performance on VLN-CE benchmarks.

02

Demonstrates effective spatial reasoning and navigation efficiency.

03

Shows improved interpretability through dual-channel scene descriptions.

Abstract

Vision Language Navigation in Continuous Environments (VLN-CE) represents a frontier in embodied AI, demanding agents to navigate freely in unbounded 3D spaces solely guided by natural language instructions. This task introduces distinct challenges in multimodal comprehension, spatial reasoning, and decision-making. To address these challenges, we introduce Cog-GA, a generative agent founded on large language models (LLMs) tailored for VLN-CE tasks. Cog-GA employs a dual-pronged strategy to emulate human-like cognitive processes. Firstly, it constructs a cognitive map, integrating temporal, spatial, and semantic elements, thereby facilitating the development of spatial memory within LLMs. Secondly, Cog-GA employs a predictive mechanism for waypoints, strategically optimizing the exploration trajectory to maximize navigational efficiency. Each waypoint is accompanied by a dual-channel…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Speech and dialogue systems