Zero-shot Object Navigation with Vision-Language Models Reasoning
Congcong Wen, Yisiyuan Huang, Hao Huang, Yanjia Huang, Shuaihang Yuan,, Yu Hao, Hui Lin, Yu-Shen Liu, Yi Fang

TL;DR
This paper introduces VLTNet, a novel vision-language model with Tree-of-Thought reasoning for zero-shot object navigation, enabling robots to follow natural language instructions in unknown environments without prior training.
Contribution
The paper presents a new VLTNet model that integrates Tree-of-Thought reasoning for improved zero-shot object navigation using natural language instructions.
Findings
VLTNet outperforms existing methods on PASTURE and RoboTHOR benchmarks.
Tree-of-Thought reasoning improves navigation accuracy in complex scenarios.
The model effectively handles natural language instructions for object navigation.
Abstract
Object navigation is crucial for robots, but traditional methods require substantial training data and cannot be generalized to unknown environments. Zero-shot object navigation (ZSON) aims to address this challenge, allowing robots to interact with unknown objects without specific training data. Language-driven zero-shot object navigation (L-ZSON) is an extension of ZSON that incorporates natural language instructions to guide robot navigation and interaction with objects. In this paper, we propose a novel Vision Language model with a Tree-of-thought Network (VLTNet) for L-ZSON. VLTNet comprises four main modules: vision language model understanding, semantic mapping, tree-of-thought reasoning and exploration, and goal identification. Among these modules, Tree-of-Thought (ToT) reasoning and exploration module serves as a core component, innovatively using the ToT reasoning framework…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Multimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques
