TL;DR
This paper introduces a convolutional recurrent neural network that jointly localizes and detects overlapping sound events in 3D space, demonstrating robustness across various array formats and real-world conditions.
Contribution
The paper presents a novel CNN-RNN model that simultaneously performs sound event detection and 3D localization without array-specific feature extraction, applicable to diverse array geometries.
Findings
Higher recall of estimated DOAs compared to baselines.
Robust performance in reverberant and low SNR environments.
Effective in scenarios with multiple overlapping sound events.
Abstract
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-output regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
