FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Jing Zuo; Lingzhou Mu; Fan Jiang; Chengcheng Ma; Mu Xu; Yonggang Qi

arXiv:2601.13976·cs.CV·January 26, 2026

FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation

Jing Zuo, Lingzhou Mu, Fan Jiang, Chengcheng Ma, Mu Xu, Yonggang Qi

PDF

Open Access 1 Models

TL;DR

FantasyVLN introduces a unified multimodal reasoning framework for vision-language navigation that maintains interpretability and reasoning capabilities without token inflation, enabling real-time, human-like navigation performance.

Contribution

The paper proposes FantasyVLN, a novel implicit reasoning approach that encodes imagined visual tokens into a compact space, reducing token inflation and improving real-time navigation efficiency.

Findings

01

Achieves higher success rates in LH-VLN benchmark.

02

Reduces inference latency by an order of magnitude.

03

Maintains reasoning capabilities without explicit token overhead.

Abstract

Achieving human-level performance in Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context while reasoning over long action sequences. Recent works, such as NavCoT and NavGPT-2, demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning. Moreover, multimodal extensions like OctoNav-R1 and CoT-VLA further validate CoT as a promising pathway toward human-like navigation reasoning. However, existing approaches face critical drawbacks: purely textual CoTs lack spatial grounding and easily overfit to sparse annotated reasoning steps, while multimodal CoTs incur severe token inflation by generating imagined visual observations, making real-time navigation impractical. In this work, we propose FantasyVLN, a unified implicit reasoning framework that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
acvlab/FantasyVLN
model· 11 dl
11 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Explainable Artificial Intelligence (XAI)