Leveraging Large Language Model-based Room-Object Relationships   Knowledge for Enhancing Multimodal-Input Object Goal Navigation

Leyuan Sun; Asako Kanezaki; Guillaume Caron; Yusuke Yoshiyasu

arXiv:2403.14163·cs.RO·March 20, 2025·1 cites

Leveraging Large Language Model-based Room-Object Relationships Knowledge for Enhancing Multimodal-Input Object Goal Navigation

Leyuan Sun, Asako Kanezaki, Guillaume Caron, Yusuke Yoshiyasu

PDF

Open Access

TL;DR

This paper introduces a modular approach that leverages large language model-derived object-to-room knowledge to improve object-goal navigation efficiency in simulated and real environments.

Contribution

It presents a novel data-driven method integrating LLM-based knowledge with multimodal inputs for enhanced navigation performance.

Findings

01

Outperforms baseline by 10.6% in SPL metric

02

Effective in both simulated and real-world environments

03

Utilizes multi-channel Swin-Unet for multi-task learning

Abstract

Object-goal navigation is a crucial engineering task for the community of embodied navigation; it involves navigating to an instance of a specified object category within unseen environments. Although extensive investigations have been conducted on both end-to-end and modular-based, data-driven approaches, fully enabling an agent to comprehend the environment through perceptual knowledge and perform object-goal navigation as efficiently as humans remains a significant challenge. Recently, large language models have shown potential in this task, thanks to their powerful capabilities for knowledge extraction and integration. In this study, we propose a data-driven, modular-based approach, trained on a dataset that incorporates common-sense knowledge of object-to-room relationships extracted from a large language model. We utilize the multi-channel Swin-Unet architecture to conduct…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems