UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model
Changxin Huang, Lv Tang, Zhaohuan Zhan, Lisha Yu, Runhao Zeng, Zun Liu, Zhengjie Wang, Jianqiang Li

TL;DR
UNeMo introduces a collaborative multimodal framework that jointly reasons about visual states and navigation decisions, significantly improving performance in vision-and-language navigation tasks by integrating visual reasoning with policy optimization.
Contribution
The paper proposes UNeMo, a novel multimodal world model with a hierarchical prediction-feedback mechanism that jointly optimizes visual reasoning and navigation policies, addressing limitations of prior methods.
Findings
Outperforms state-of-the-art in unseen scene navigation accuracy
Achieves 2.1% and 0.7% improvements on R2R and REVERIE datasets
Demonstrates effective cross-modal reasoning and policy collaboration
Abstract
Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instructions--remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives.To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Constraint Satisfaction and Optimization
