UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model

Changxin Huang; Lv Tang; Zhaohuan Zhan; Lisha Yu; Runhao Zeng; Zun Liu; Zhengjie Wang; Jianqiang Li

arXiv:2511.18845·cs.AI·February 10, 2026

UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model

Changxin Huang, Lv Tang, Zhaohuan Zhan, Lisha Yu, Runhao Zeng, Zun Liu, Zhengjie Wang, Jianqiang Li

PDF

Open Access 1 Video

TL;DR

UNeMo introduces a collaborative multimodal framework that jointly reasons about visual states and navigation decisions, significantly improving performance in vision-and-language navigation tasks by integrating visual reasoning with policy optimization.

Contribution

The paper proposes UNeMo, a novel multimodal world model with a hierarchical prediction-feedback mechanism that jointly optimizes visual reasoning and navigation policies, addressing limitations of prior methods.

Findings

01

Outperforms state-of-the-art in unseen scene navigation accuracy

02

Achieves 2.1% and 0.7% improvements on R2R and REVERIE datasets

03

Demonstrates effective cross-modal reasoning and policy collaboration

Abstract

Vision-and-Language Navigation (VLN) requires agents to autonomously navigate complex environments via visual images and natural language instructions--remains highly challenging. Recent research on enhancing language-guided navigation reasoning using pre-trained large language models (LLMs) has shown promising prospects. However, the reasoning of such methods is limited to the linguistic modality, lacking visual reasoning capabilities. Moreover, existing reasoning modules are optimized separately from navigation policies, leading to incompatibility and potential conflicts in optimization objectives.To tackle these challenges, we introduce UNeMo, a novel framework designed for the collaborative optimization of visual state reasoning and navigational decision-making. It introduces a Multimodal World Model (MWM) that takes visual features, language instructions, and navigational actions…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Constraint Satisfaction and Optimization