Active Audio-Visual Separation of Dynamic Sound Sources
Sagnik Majumder, Kristen Grauman

TL;DR
This paper introduces an active audio-visual separation method where an embodied agent uses reinforcement learning and transformer memory to dynamically isolate target sounds in complex 3D environments, improving continuous separation accuracy.
Contribution
It presents a novel reinforcement learning approach with transformer memory for active audio-visual separation of dynamic sound sources in realistic 3D environments.
Findings
The model effectively learns to control camera and microphone for sound separation.
It achieves high-quality separation of dynamic audio sources in simulated environments.
The approach outperforms baseline methods in continuous separation tasks.
Abstract
We explore active audio-visual separation for dynamic sound sources, where an embodied agent moves intelligently in a 3D environment to continuously isolate the time-varying audio stream being emitted by an object of interest. The agent hears a mixed stream of multiple audio sources (e.g., multiple people conversing and a band playing music at a noisy party). Given a limited time budget, it needs to extract the target sound accurately at every step using egocentric audio-visual observations. We propose a reinforcement learning agent equipped with a novel transformer memory that learns motion policies to control its camera and microphone to recover the dynamic target audio, using self-attention to make high-quality estimates for current timesteps and also simultaneously improve its past estimates. Using highly realistic acoustic SoundSpaces simulations in real-world scanned Matterport3D…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Music Technology and Sound Studies · Music and Audio Processing
