STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation

Haokun Zhu; Zongtai Li; Zhixuan Liu; Wenshan Wang; Ji Zhang; Jonathan Francis; Jean Oh

arXiv:2505.06729·cs.RO·September 17, 2025

STRIVE: Structured Representation Integrating VLM Reasoning for Efficient Object Navigation

Haokun Zhu, Zongtai Li, Zhixuan Liu, Wenshan Wang, Ji Zhang, Jonathan Francis, Jean Oh

PDF

Open Access

TL;DR

This paper introduces STRIVE, a structured environment representation and two-stage navigation policy that leverages vision-language models for more efficient and accurate object navigation in simulated and real indoor environments.

Contribution

The paper presents a novel multi-layer environment representation and a two-stage navigation policy integrating VLM reasoning, improving navigation success and efficiency.

Findings

01

Achieved 7.1% higher success rate on benchmarks.

02

Improved navigation efficiency by 12.5%.

03

Demonstrated robustness on real robot tasks.

Abstract

Vision-Language Models (VLMs) have been increasingly integrated into object navigation tasks for their rich prior knowledge and strong reasoning abilities. However, applying VLMs to navigation poses two key challenges: effectively representing complex environment information and determining \textit{when and how} to query VLMs. Insufficient environment understanding and over-reliance on VLMs (e.g. querying at every step) can lead to unnecessary backtracking and reduced navigation efficiency, especially in continuous environments. To address these challenges, we propose a novel framework that constructs a multi-layer representation of the environment during navigation. This representation consists of viewpoint, object nodes, and room nodes. Viewpoints and object nodes facilitate intra-room exploration and accurate target localization, while room nodes support efficient inter-room…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Social Robot Interaction and HRI