Learning Semantic-Agnostic and Spatial-Aware Representation for   Generalizable Visual-Audio Navigation

Hongcheng Wang; Yuxuan Wang; Fangwei Zhong; Mingdong Wu; Jianwei; Zhang; Yizhou Wang; Hao Dong

arXiv:2304.10773·cs.RO·June 22, 2023·1 cites

Learning Semantic-Agnostic and Spatial-Aware Representation for Generalizable Visual-Audio Navigation

Hongcheng Wang, Yuxuan Wang, Fangwei Zhong, Mingdong Wu, Jianwei, Zhang, Yizhou Wang, Hao Dong

PDF

Open Access 1 Repo

TL;DR

This paper introduces a brain-inspired method for visual-audio navigation that learns semantic-agnostic and spatial-aware representations, improving generalization to unseen sounds and environments in robotic navigation tasks.

Contribution

The authors propose a novel auxiliary-task-based approach to learn representations that generalize across unseen sounds and environments, addressing limitations of previous methods.

Findings

01

Improved zero-shot generalization to unseen scenes and sounds

02

Achieved better performance on Replica and Matterport3D datasets

03

Demonstrated robustness in realistic 3D navigation scenarios

Abstract

Visual-audio navigation (VAN) is attracting more and more attention from the robotic community due to its broad applications, \emph{e.g.}, household robots and rescue robots. In this task, an embodied agent must search for and navigate to the sound source with egocentric visual and audio observations. However, the existing methods are limited in two aspects: 1) poor generalization to unheard sound categories; 2) sample inefficient in training. Focusing on these two problems, we propose a brain-inspired plug-and-play method to learn a semantic-agnostic and spatial-aware representation for generalizable visual-audio navigation. We meticulously design two auxiliary tasks for respectively accelerating learning representations with the above-desired characteristics. With these two auxiliary tasks, the agent learns a spatially-correlated representation of visual and audio inputs that can be…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

wwwwwyyyyyxxxxx/sa2gvan
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Music and Audio Processing · Hearing Loss and Rehabilitation