TopV-Nav: Unlocking the Top-View Spatial Reasoning Potential of MLLM for Zero-shot Object Navigation
Linqing Zhong, Chen Gao, Zihan Ding, Yue Liao, Huimin Ma, Shifeng, Zhang, Xu Zhou, Si Liu

TL;DR
TopV-Nav leverages top-view maps and adaptive reasoning techniques to significantly improve zero-shot object navigation by maintaining spatial information and enabling more human-like exploration.
Contribution
The paper introduces TopV-Nav, a novel MLLM-based approach that directly reasons on top-view maps with adaptive prompts, dynamic scaling, and target prediction mechanisms for improved navigation.
Findings
Outperforms existing methods on MP3D and HM3D datasets.
Enhances spatial reasoning with adaptive visual prompts and dynamic map scaling.
Facilitates more effective global and local exploration strategies.
Abstract
The Zero-Shot Object Navigation (ZSON) task requires embodied agents to find a previously unseen object by navigating in unfamiliar environments. Such a goal-oriented exploration heavily relies on the ability to perceive, understand, and reason based on the spatial information of the environment. However, current LLM-based approaches convert visual observations to language descriptions and reason in the linguistic space, leading to the loss of spatial information. In this paper, we introduce TopV-Nav, an MLLM-based method that directly reasons on the top-view map with sufficient spatial information. To fully unlock the MLLM's spatial reasoning potential in top-view perspective, we propose the Adaptive Visual Prompt Generation (AVPG) method to adaptively construct semantically-rich top-view map. It enables the agent to directly utilize spatial information contained in the top-view map to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobotic Path Planning Algorithms · Multimodal Machine Learning Applications · Robotics and Sensor-Based Localization
MethodsSemi-Pseudo-Label
