TL;DR
This paper introduces SALSA, a novel feature combining log-spectrograms with spatial cues, improving polyphonic sound event localization and detection by effectively resolving overlapping sources across different microphone formats.
Contribution
The paper proposes SALSA, a new feature that integrates time-frequency and spatial information, enabling better joint optimization of sound event detection and localization tasks.
Findings
SALSA outperforms state-of-the-art features on the TAU-NIGENS dataset.
Using SALSA increases F1 score and localization recall significantly.
Applicable to various microphone array formats like FOA and MIC.
Abstract
Sound event localization and detection (SELD) consists of two subtasks, which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks. We propose a novel feature called Spatial cue-Augmented Log-SpectrogrAm (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the microphone…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
