MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming
Shuo Wang, Yongcai Wang, Zhaoxin Fan, Yucheng Wang, Maiyue Chen, Kaihui Wang, Zhizhong Su, Wanting Li, Xudong Cai, Yeying Jin, Deying Li

TL;DR
MonoDream introduces a monocular vision-language navigation framework that learns a unified representation and uses latent panoramic dreaming tasks, significantly improving monocular navigation performance and bridging the gap with panoramic methods.
Contribution
It proposes MonoDream, a novel lightweight VLA framework with a unified navigation representation and latent panoramic dreaming tasks for monocular VLN.
Findings
MonoDream improves monocular navigation accuracy.
It narrows the performance gap with panoramic-based agents.
The approach is effective across multiple VLN benchmarks.
Abstract
Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Robotic Path Planning Algorithms · Robotics and Sensor-Based Localization
