CA-VLN: Collaborative Agents in MLLM-Powered Visual-Language Navigation

Ruolin Zhu; Shaobin Li; Zixing Zhu; Jing Jia; Min Yang

PMC · DOI:10.3390/s26041254·February 14, 2026

CA-VLN: Collaborative Agents in MLLM-Powered Visual-Language Navigation

Ruolin Zhu, Shaobin Li, Zixing Zhu, Jing Jia, Min Yang

PDF

Open Access

TL;DR

This paper introduces a new framework for visual-language navigation using collaborative agents powered by multimodal large language models to improve generalization in unseen environments.

Contribution

The novel dual-agent framework combines semantic reasoning and episodic memory for enhanced navigation generalization.

Findings

01

The proposed CA-VLN framework achieves state-of-the-art performance on R2R, REVERIE, and SOON datasets.

02

The model significantly improves generalization and navigation success in previously unobserved environments.

Abstract

Generalization to unseen environments remains a fundamental challenge in Vision-Language Navigation. To tackle this issue, we propose a novel framework that leverages world knowledge embedded within Multimodal Large Language Models. We introduce Collaborative Agents in Visual-Language Navigation (CA-VLN), a framework based on a dual-agent architecture. This architecture comprises a Knowledge Agent, which infuses the action prediction process with semantic context and commonsense reasoning, and a Hierarchical History Agent, which constructs a detailed episodic memory to enable long-horizon planning. The collaboration between these agents facilitates a dynamic interplay between high-level semantic understanding and grounded episodic experience. Extensive experiments on the R2R, REVERIE and SOON datasets demonstrate that our model achieves state-of-the-art performance, significantly…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Chemicals3

CA water CA-VLN

Diseases2

injury to hallucination

Figures8

Click any figure to enlarge with its caption.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling