A Robust framework for sound event localization and detection on real recordings

Jin Sob Kim; Hyun Joon Park; Wooseok Shin; Sung Won Han

arXiv:2512.22156·cs.SD·December 30, 2025

A Robust framework for sound event localization and detection on real recordings

Jin Sob Kim, Hyun Joon Park, Wooseok Shin, Sung Won Han

PDF

Open Access

TL;DR

This paper presents a robust ResNet-based framework for sound event localization and detection that leverages data augmentation, dataset mixing, and ensemble techniques to improve performance on real-world recordings.

Contribution

The authors introduce a comprehensive framework combining augmentation, dataset mixing, and ensemble methods to enhance SELD performance on real-world data.

Findings

01

Outperforms baseline methods in real-world sound recordings

02

Achieves competitive SELD performance

03

Effective use of augmentation and ensemble techniques

Abstract

This technical report describes the systems submitted to the DCASE2022 challenge task 3: sound event localization and detection (SELD). The task aims to detect occurrences of sound events and specify their class, furthermore estimate their position. Our system utilizes a ResNet-based model under a proposed robust framework for SELD. To guarantee the generalized performance on the real-world sound scenes, we design the total framework with augmentation techniques, a pipeline of mixing datasets from real-world sound scenes and emulations, and test time augmentation. Augmentation techniques and exploitation of external sound sources enable training diverse samples and keeping the opportunity to train the real-world context enough by maintaining the number of the real recording samples in the batch. In addition, we design a test time augmentation and a clustering-based model ensemble method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis