LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and   Navigation

Tianrui Guan; Yurou Yang; Harry Cheng; Muyuan Lin; Richard Kim,; Rajasimman Madhivanan; Arnie Sen; Dinesh Manocha

arXiv:2405.05363·cs.CV·May 10, 2024·2 cites

LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation

Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim,, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

PDF

Open Access

TL;DR

LOC-ZSON introduces a language-driven, object-centric image representation and training approach that enhances zero-shot object retrieval and navigation in complex scenes, demonstrating significant improvements in success rates across simulated and real-world environments.

Contribution

The paper presents a novel object-centric image representation and a training framework using language models for improved zero-shot object navigation and retrieval.

Findings

01

Achieves 1.38-13.38% improvement in text-to-image recall.

02

Shows 5% and 16.67% better success rates in simulation and real-world navigation.

03

Introduces LLM-based augmentation and prompt templates for stable training.

Abstract

In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques