A Navigation Framework Utilizing Vision-Language Models

Yicheng Duan; Kaiyu tang

arXiv:2506.10172·cs.RO·June 13, 2025

A Navigation Framework Utilizing Vision-Language Models

Yicheng Duan, Kaiyu tang

PDF

Open Access 1 Repo

TL;DR

This paper introduces a modular navigation framework that leverages large vision-language models to improve embodied AI navigation tasks, emphasizing efficiency and adaptability in complex environments.

Contribution

It proposes a decoupled, plug-and-play architecture integrating LVLMs with lightweight planning, enabling flexible navigation without extensive fine-tuning.

Findings

01

Achieved promising results on the Room-to-Room benchmark.

02

Identified challenges in generalizing to unseen environments.

03

Highlighted the potential of modular approaches for scalable navigation.

Abstract

Vision-and-Language Navigation (VLN) presents a complex challenge in embodied AI, requiring agents to interpret natural language instructions and navigate through visually rich, unfamiliar environments. Recent advances in large vision-language models (LVLMs), such as CLIP and Flamingo, have significantly improved multimodal understanding but introduced new challenges related to computational cost and real-time deployment. In this project, we propose a modular, plug-and-play navigation framework that decouples vision-language understanding from action planning. By integrating a frozen vision-language model, Qwen2.5-VL-7B-Instruct, with lightweight planning logic, we aim to achieve flexible, fast, and adaptable navigation without extensive model fine-tuning. Our framework leverages prompt engineering, structured history management, and a two-frame visual input strategy to enhance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

yichengduan/oobvlm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Social Robot Interaction and HRI · Advanced Neural Network Applications

MethodsContrastive Language-Image Pre-training