STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation
Haokun Zhu, Zongtai Li, Zhixuan Liu, Wenshan Wang, Ji Zhang, Jonathan Francis, Jean Oh

TL;DR
This paper introduces STRIVE, a structured environment representation and two-stage navigation policy that leverages vision-language models for more efficient and accurate object navigation in simulated and real indoor environments.
Contribution
The paper presents a novel multi-layer environment representation and a two-stage navigation policy integrating VLM reasoning, improving navigation success and efficiency.
Findings
Achieved 7.1% higher success rate on benchmarks.
Improved navigation efficiency by 12.5%.
Demonstrated robustness on real robot tasks.
Abstract
Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation poses two key challenges: effectively representing complex environment information and determining \textit{when and how} to query VLMs. Insufficient environment understanding and over-reliance on VLMs (e.g. querying at every step) can lead to unnecessary backtracking and reduced navigation efficiency, especially in continuous environments. To address these challenges, we propose a novel framework that constructs a multi-layer representation of the environment during navigation. This representation consists of viewpoint, object nodes, and room nodes. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Social Robot Interaction and HRI
