LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation
Tianrui Guan, Yurou Yang, Harry Cheng, Muyuan Lin, Richard Kim,, Rajasimman Madhivanan, Arnie Sen, Dinesh Manocha

TL;DR
LOC-ZSON introduces a language-driven, object-centric image representation and training approach that enhances zero-shot object retrieval and navigation in complex scenes, demonstrating significant improvements in success rates across simulated and real-world environments.
Contribution
The paper presents a novel object-centric image representation and a training framework using language models for improved zero-shot object navigation and retrieval.
Findings
Achieves 1.38-13.38% improvement in text-to-image recall.
Shows 5% and 16.67% better success rates in simulation and real-world navigation.
Introduces LLM-based augmentation and prompt templates for stable training.
Abstract
In this paper, we present LOC-ZSON, a novel Language-driven Object-Centric image representation for object navigation task within complex scenes. We propose an object-centric image representation and corresponding losses for visual-language model (VLM) fine-tuning, which can handle complex object-level queries. In addition, we design a novel LLM-based augmentation and prompt templates for stability during training and zero-shot inference. We implement our method on Astro robot and deploy it in both simulated and real-world environments for zero-shot object navigation. We show that our proposed method can achieve an improvement of 1.38 - 13.38% in terms of text-to-image recall on different benchmark settings for the retrieval task. For object navigation, we show the benefit of our approach in simulation and real world, showing 5% and 16.67% improvement in terms of navigation success…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
