ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings
Arjun Majumdar, Gunjan Aggarwal, Bhavika Devnani, Judy Hoffman, Dhruv, Batra

TL;DR
This paper introduces ZSON, a zero-shot object-goal navigation method that leverages multimodal goal embeddings trained on image-goal tasks, enabling agents to find objects in open-world environments using natural language instructions without prior rewards or demonstrations.
Contribution
The paper proposes a novel zero-shot approach for object-goal navigation using multimodal semantic embeddings trained on image-goal data, allowing natural language goal specification in open environments.
Findings
Achieves 4.2% to 20.0% success improvement over existing zero-shot methods.
Enables agents to follow complex and compound natural language instructions.
Generalizes well across multiple datasets and real-world scenarios.
Abstract
We present a scalable approach for learning open-world object-goal navigation (ObjectNav) -- the task of asking a virtual robot (agent) to find any instance of an object in an unexplored environment (e.g., "find a sink"). Our approach is entirely zero-shot -- i.e., it does not require ObjectNav rewards or demonstrations of any kind. Instead, we train on the image-goal navigation (ImageNav) task, in which agents find the location where a picture (i.e., goal image) was captured. Specifically, we encode goal images into a multimodal, semantic embedding space to enable training semantic-goal navigation (SemanticNav) agents at scale in unannotated 3D environments (e.g., HM3D). After training, SemanticNav agents can be instructed to find objects described in free-form natural language (e.g., "sink", "bathroom sink", etc.) by projecting language goals into the same multimodal, semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
