DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes
Zhaowei Wang, Hongming Zhang, Tianqing Fang, Ye Tian, Yue Yang, Kaixin Ma, Xiaoman Pan, Yangqiu Song, Dong Yu

TL;DR
This paper introduces DivScene, a large-scale dataset for open-vocabulary object navigation in diverse scenes, and demonstrates that fine-tuned LVLMs can significantly improve navigation success rates using BFS-generated paths.
Contribution
The paper presents DivScene, a comprehensive dataset for open-vocabulary navigation, and shows how fine-tuning LVLMs with BFS paths enhances navigation performance.
Findings
Current LVLMs underperform in open-vocab navigation.
Fine-tuning LVLMs with BFS paths improves success rates by over 20%.
DivScene enables thorough evaluation of navigation models.
Abstract
Large Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DivScene, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs to predict the next action with CoT explanations. We observe that LVLM's navigation ability can be improved substantially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsRobotics and Sensor-Based Localization · Multimodal Machine Learning Applications · Robotic Path Planning Algorithms
MethodsFocus
