PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

Zehua Fan; Wenqi Lyu; Wenxuan Song; Linge Zhao; Yifei Yang; Xi Wang; Junjie He; Lida Huang; Haiyan Liu; Bingchuan Sun; Guangjun Bao; Xuanyao Mao; Liang Xu; Yan Wang; and Feng Gao

arXiv:2603.03739·cs.CV·March 5, 2026

PROSPECT: Unified Streaming Vision-Language Navigation via Semantic--Spatial Fusion and Latent Predictive Representation

Zehua Fan, Wenqi Lyu, Wenxuan Song, Linge Zhao, Yifei Yang, Xi Wang, Junjie He, Lida Huang, Haiyan Liu, Bingchuan Sun, Guangjun Bao, Xuanyao Mao, Liang Xu, Yan Wang, and Feng Gao

PDF

Open Access

TL;DR

PROSPECT introduces a unified streaming navigation system combining semantic and spatial features with predictive modeling, achieving state-of-the-art results in vision-language navigation and robustness in real-world deployment.

Contribution

It presents a novel unified streaming agent that fuses semantic and spatial features with latent predictive learning for improved navigation.

Findings

01

Achieves state-of-the-art performance on VLN-CE benchmarks.

02

Demonstrates robustness in real-robot deployment under diverse lighting.

03

Improves long-horizon navigation stability.

Abstract

Multimodal large language models (MLLMs) have advanced zero-shot end-to-end Vision-Language Navigation (VLN), yet robust navigation requires not only semantic understanding but also predictive modeling of environment dynamics and spatial structure. We propose PROSPECT, a unified streaming navigation agent that couples a streaming Vision-Language-Action (VLA) policy with latent predictive representation learning. PROSPECT uses CUT3R as a streaming 3D foundation spatial encoder to produce long-context, absolute-scale spatial features, and fuses them with SigLIP semantic features via cross-attention. During training, we introduce learnable stream query tokens that query the streaming context and predict next-step 2D and 3D latent features (rather than pixels or explicit modalities), supervised in the latent spaces of frozen SigLIP and CUT3R teachers. The predictive branch shapes internal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Robotics and Sensor-Based Localization