DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

Zhaowei Wang; Hongming Zhang; Tianqing Fang; Ye Tian; Yue Yang; Kaixin Ma; Xiaoman Pan; Yangqiu Song; Dong Yu

arXiv:2410.02730·cs.CV·September 3, 2025

DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes

Zhaowei Wang, Hongming Zhang, Tianqing Fang, Ye Tian, Yue Yang, Kaixin Ma, Xiaoman Pan, Yangqiu Song, Dong Yu

PDF

Open Access 1 Repo 1 Datasets 1 Video

TL;DR

This paper introduces DivScene, a large-scale dataset for open-vocabulary object navigation in diverse scenes, and demonstrates that fine-tuned LVLMs can significantly improve navigation success rates using BFS-generated paths.

Contribution

The paper presents DivScene, a comprehensive dataset for open-vocabulary navigation, and shows how fine-tuning LVLMs with BFS paths enhances navigation performance.

Findings

01

Current LVLMs underperform in open-vocab navigation.

02

Fine-tuning LVLMs with BFS paths improves success rates by over 20%.

03

DivScene enables thorough evaluation of navigation models.

Abstract

Large Vision-Language Models (LVLMs) have achieved significant progress in tasks like visual question answering and document understanding. However, their potential to comprehend embodied environments and navigate within them remains underexplored. In this work, we first study the challenge of open-vocabulary object navigation by introducing DivScene, a large-scale dataset with 4,614 houses across 81 scene types and 5,707 kinds of target objects. Our dataset provides a much greater diversity of target objects and scene types than existing datasets, enabling a comprehensive task evaluation. We evaluated various methods with LVLMs and LLMs on our dataset and found that current models still fall short of open-vocab object navigation ability. Then, we fine-tuned LVLMs to predict the next action with CoT explanations. We observe that LVLM's navigation ability can be improved substantially…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zhaowei-wang-nlp/divscene
pytorchOfficial

Datasets

ZhaoweiWang/DivScene-DivTraj
dataset· 15 dl
15 dl

Videos

DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes· underline

Taxonomy

TopicsRobotics and Sensor-Based Localization · Multimodal Machine Learning Applications · Robotic Path Planning Algorithms

MethodsFocus