LightZeroNav: Zero-Shot Vision Language Navigation in Continuous Environments Based on Lightweight VLMs
Kun Luo, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, and Yaoming Zhou

TL;DR
LightZeroNav introduces a lightweight, zero-shot vision-language navigation method in continuous environments, overcoming key challenges with minimal resources and no task-specific training, achieving competitive results.
Contribution
The paper presents LightZeroNav, a novel approach that enables zero-shot VLN-CE using only RGB inputs and a lightweight VLM, addressing major bottlenecks without extensive training.
Findings
Achieves competitive performance with GPT-4o (~200B) in zero-shot VLN-CE.
Effectively handles information redundancy and noisy textual memory.
Operates without task-specific training, graph search, or waypoint predictors.
Abstract
Although vision-language navigation (VLN) has progressed rapidly, zero-shot VLN in continuous environments (VLN-CE) remains highly challenging when using lightweight vision-language models (VLMs), whose limited reasoning capacity makes long-horizon navigation unreliable. In this paper, we propose LightZeroNav to tackle the three major bottlenecks when using lightweight VLMs in zero-shot VLN-CE,i.e.,information redundancy from multi-source inputs, inaccurate progress estimation caused by noisy textual memory, and task entanglement between action execution and stage transition. Using only RGB observations and a lightweight open-source Qwen3-VL-8B backbone, LightZeroNav achieves competitive performance with GPT-4o (~200B) without task-specific training, graph search, or waypoint predictors, demonstrating its effectiveness in zero-shot VLN-CE.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Social Robot Interaction and HRI
