Sparse Video Generation Propels Real-World Beyond-the-View Vision-Language Navigation
Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, Hongyang Li

TL;DR
This paper introduces SparseVideoNav, a novel approach for beyond-the-view navigation that leverages sparse video generation for long-horizon supervision, enabling real-time, autonomous navigation in unknown environments with high success rates.
Contribution
The paper pioneers the use of video generation models for beyond-the-view navigation, achieving fast, long-horizon guidance and surpassing existing LLM-based methods in success rate and real-world applicability.
Findings
SparseVideoNav achieves 2.5x success rate over LLM baselines.
It enables sub-second trajectory inference for 20-second horizons.
First real-world zero-shot navigation in night scenes.
Abstract
Why must vision-language navigation be bound to detailed and verbose language instructions? While such details ease decision-making, they fundamentally contradict the goal for navigation in the real-world. Ideally, agents should possess the autonomy to navigate in unknown environments guided solely by simple and high-level intents. Realizing this ambition introduces a formidable challenge: Beyond-the-View Navigation (BVN), where agents must locate distant, unseen targets without dense and step-by-step guidance. Existing large language model (LLM)-based methods, though adept at following dense instructions, often suffer from short-sighted behaviors due to their reliance on short-horimzon supervision. Simply extending the supervision horizon, however, destabilizes LLM training. In this work, we identify that video generation models inherently benefit from long-horizon supervision to align…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
