Zero-shot Object Navigation with Vision-Language Models Reasoning

Congcong Wen; Yisiyuan Huang; Hao Huang; Yanjia Huang; Shuaihang Yuan,; Yu Hao; Hui Lin; Yu-Shen Liu; Yi Fang

arXiv:2410.18570·cs.RO·October 25, 2024

Zero-shot Object Navigation with Vision-Language Models Reasoning

Congcong Wen, Yisiyuan Huang, Hao Huang, Yanjia Huang, Shuaihang Yuan,, Yu Hao, Hui Lin, Yu-Shen Liu, Yi Fang

PDF

Open Access

TL;DR

This paper introduces VLTNet, a novel vision-language model with Tree-of-Thought reasoning for zero-shot object navigation, enabling robots to follow natural language instructions in unknown environments without prior training.

Contribution

The paper presents a new VLTNet model that integrates Tree-of-Thought reasoning for improved zero-shot object navigation using natural language instructions.

Findings

01

VLTNet outperforms existing methods on PASTURE and RoboTHOR benchmarks.

02

Tree-of-Thought reasoning improves navigation accuracy in complex scenarios.

03

The model effectively handles natural language instructions for object navigation.

Abstract

Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRobotic Path Planning Algorithms · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques