A Navigation Framework Utilizing Vision-Language Models
Yicheng Duan, Kaiyu tang

TL;DR
This paper introduces a modular navigation framework that leverages large vision-language models to improve embodied AI navigation tasks, emphasizing efficiency and adaptability in complex environments.
Contribution
It proposes a decoupled, plug-and-play architecture integrating LVLMs with lightweight planning, enabling flexible navigation without extensive fine-tuning.
Findings
Achieved promising results on the Room-to-Room benchmark.
Identified challenges in generalizing to unseen environments.
Highlighted the potential of modular approaches for scalable navigation.
Abstract
Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Advanced Neural Network Applications
MethodsContrastive Language-Image Pre-training
