StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Wei; Chenyang Wan; Xiqian Yu; Tai Wang; Yuqiang Yang; Xiaohan Mao; Chenming Zhu; Wenzhe Cai; Hanqing Wang; Yilun Chen; Xihui Liu; Jiangmiao Pang

arXiv:2507.05240·cs.RO·July 8, 2025

StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling

Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang

PDF

1 Models 1 Datasets

TL;DR

StreamVLN introduces a hybrid slow-fast context modeling framework for real-time vision-and-language navigation, enabling efficient, coherent multi-turn dialogue understanding over long visual streams with low latency.

Contribution

It proposes a novel slow-fast context modeling strategy that balances visual understanding, long-term context, and computational efficiency in streaming VLN tasks.

Findings

01

Achieves state-of-the-art performance on VLN-CE benchmarks.

02

Supports long video streams with bounded context and low inference cost.

03

Demonstrates robust and efficient real-world deployment capabilities.

Abstract

Vision-and-Language Navigation (VLN) in real-world settings requires agents to process continuous visual streams and generate actions with low latency grounded in language instructions. While Video-based Large Language Models (Video-LLMs) have driven recent progress, current VLN methods based on Video-LLM often face trade-offs among fine-grained visual understanding, long-term context modeling and computational efficiency. We introduce StreamVLN, a streaming VLN framework that employs a hybrid slow-fast context modeling strategy to support multi-modal reasoning over interleaved vision, language and action inputs. The fast-streaming dialogue context facilitates responsive action generation through a sliding-window of active dialogues, while the slow-updating memory context compresses historical visual states using a 3D-aware token pruning strategy. With this slow-fast design, StreamVLN…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
mengwei0427/StreamVLN_Video_qwen_1_5_r2r_rxr_envdrop_scalevln
model· 24 dl· ♡ 2
24 dl♡ 2

Datasets

cywan/StreamVLN-Trajectory-Data
dataset· 368 dl
368 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.