A Robust framework for sound event localization and detection on real recordings
Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han

TL;DR
This paper presents a robust ResNet-based framework for sound event localization and detection that leverages data augmentation, dataset mixing, and ensemble techniques to improve performance on real-world recordings.
Contribution
The authors introduce a comprehensive framework combining augmentation, dataset mixing, and ensemble methods to enhance SELD performance on real-world data.
Findings
Outperforms baseline methods in real-world sound recordings
Achieves competitive SELD performance
Effective use of augmentation and ensemble techniques
Abstract
This technical report describes the systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD). The task aims to detect occurrences of sound events and specify their class, furthermore estimate their position. Our system utilizes a ResNet-based model under a proposed robust framework for SELD. To guarantee the generalized performance on the real-world sound scenes, we design the total framework with augmentation techniques, a pipeline of mixing datasets from real-world sound scenes and emulations, and test time augmentation. Augmentation techniques and exploitation of external sound sources enable training diverse samples and keeping the opportunity to train the real-world context enough by maintaining the number of the real recording samples in the batch. In addition, we design a test time augmentation and a clustering-based model ensemble method…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
