CA-VLN: Collaborative Agents in MLLM-Powered Visual-Language Navigation
Ruolin Zhu, Shaobin Li, Zixing Zhu, Jing Jia, Min Yang

TL;DR
This paper introduces a new framework for visual-language navigation using collaborative agents powered by multimodal large language models to improve generalization in unseen environments.
Contribution
The novel dual-agent framework combines semantic reasoning and episodic memory for enhanced navigation generalization.
Findings
The proposed CA-VLN framework achieves state-of-the-art performance on R2R, REVERIE, and SOON datasets.
The model significantly improves generalization and navigation success in previously unobserved environments.
Abstract
Generalization to unseen environments remains a fundamental challenge in Vision-Language Navigation. To tackle this issue, we propose a novel framework that leverages world knowledge embedded within Multimodal Large Language Models. We introduce Collaborative Agents in Visual-Language Navigation (CA-VLN), a framework based on a dual-agent architecture. This architecture comprises a Knowledge Agent, which infuses the action prediction process with semantic context and commonsense reasoning, and a Hierarchical History Agent, which constructs a detailed episodic memory to enable long-horizon planning. The collaboration between these agents facilitates a dynamic interplay between high-level semantic understanding and grounded episodic experience. Extensive experiments on the R2R, REVERIE and SOON datasets demonstrate that our model achieves state-of-the-art performance, significantly…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
