Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation
Hongcheng Wang, Yuxuan Wang, Fangwei Zhong, Mingdong Wu, Jianwei, Zhang, Yizhou Wang, Hao Dong

TL;DR
This paper introduces a brain-inspired method for visual-audio navigation that learns semantic-agnostic and spatial-aware representations, improving generalization to unseen sounds and environments in robotic navigation tasks.
Contribution
The authors propose a novel auxiliary-task-based approach to learn representations that generalize across unseen sounds and environments, addressing limitations of previous methods.
Findings
Improved zero-shot generalization to unseen scenes and sounds
Achieved better performance on Replica and Matterport3D datasets
Demonstrated robustness in realistic 3D navigation scenarios
Abstract
Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, \emph{e.g.}, household robots and rescue robots. In this task, an embodied agent must search for and navigate to the sound source with egocentric visual and audio observations. However, the existing methods are limited in two aspects: 1) poor generalization to unheard sound categories; 2) sample inefficient in training. Focusing on these two problems, we propose a brain-inspired plug-and-play method to learn a semantic-agnostic and spatial-aware representation for generalizable visual-audio navigation. We meticulously design two auxiliary tasks for respectively accelerating learning representations with the above-desired characteristics. With these two auxiliary tasks, the agent learns a spatially-correlated representation of visual and audio inputs that can be…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation
