A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection

Hogeon Yu

arXiv:2507.22322·cs.SD·July 31, 2025

A Two-Step Learning Framework for Enhancing Sound Event Localization and Detection

Hogeon Yu

PDF

TL;DR

This paper introduces a two-step learning framework for sound event localization and detection that improves spatial and event recognition by maintaining temporal consistency and preventing task interference.

Contribution

The proposed framework combines a tracwise reordering format with task-specific training and feature fusion, addressing limitations of existing single- and dual-branch SELD models.

Findings

01

Enhanced SELD performance on DCASE 2023 dataset

02

Better spatial and event classification accuracy

03

Overcomes optimization conflicts in existing models

Abstract

Sound Event Localization and Detection (SELD) is crucial in spatial audio processing, enabling systems to detect sound events and estimate their 3D directions. Existing SELD methods use single- or dual-branch architectures: single-branch models share SED and DoA representations, causing optimization conflicts, while dual-branch models separate tasks but limit information exchange. To address this, we propose a two-step learning framework. First, we introduce a tracwise reordering format to maintain temporal consistency, preventing event reassignments across tracks. Next, we train SED and DoA networks to prevent interference and ensure task-specific feature learning. Finally, we effectively fuse DoA and SED features to enhance SELD performance with better spatial and event representation. Experiments on the 2023 DCASE challenge Task 3 dataset validate our framework, showing its ability…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.