Dynamical Audio-Visual Navigation: Catching Unheard Moving Sound Sources in Unmapped 3D Environments
Abdelrahman Younes

TL;DR
This paper introduces a new dynamic audio-visual navigation benchmark where an AI agent must catch moving sound sources in complex, noisy, and unmapped 3D environments, using a multi-modal reinforcement learning approach.
Contribution
The paper presents a novel benchmark and a multi-modal reinforcement learning method that improves generalization and robustness in dynamic, noisy, and unseen sound source navigation tasks.
Findings
Outperforms state-of-the-art methods in new benchmark
Shows better generalization to unheard sounds
Demonstrates robustness in noisy scenarios
Abstract
Recent work on audio-visual navigation targets a single static sound in noise-free audio environments and struggles to generalize to unheard sounds. We introduce the novel dynamic audio-visual navigation benchmark in which an embodied AI agent must catch a moving sound source in an unmapped environment in the presence of distractors and noisy sounds. We propose an end-to-end reinforcement learning approach that relies on a multi-modal architecture that fuses the spatial audio-visual information from a binaural audio signal and spatial occupancy maps to encode the features needed to learn a robust navigation policy for our new complex task settings. We demonstrate that our approach outperforms the current state-of-the-art with better generalization to unheard sounds and better robustness to noisy scenarios on the two challenging 3D scanned real-world datasets Replica and Matterport3D,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
