CoLoC: Conditioned Localizer and Classifier for Sound Event Localization and Detection
S{\l}awomir Kapka, Jakub Tkaczuk

TL;DR
This paper introduces CoLoC, a two-stage neural network approach for sound event localization and detection that improves accuracy by conditioning classification on localization outputs and handling an unknown number of sources.
Contribution
The paper presents a novel two-stage SELD model with conditioned classification and source number estimation, outperforming baseline systems on the STARSS22 dataset.
Findings
Improved metrics over baseline on STARSS22 dataset
Effective handling of unknown number of sound sources
Two single-output models are suitable for SELD tasks
Abstract
In this article, we describe Conditioned Localizer and Classifier (CoLoC) which is a novel solution for Sound Event Localization and Detection (SELD). The solution constitutes of two stages: the localization is done first and is followed by classification conditioned by the output of the localizer. In order to resolve the problem of the unknown number of sources we incorporate the idea borrowed from Sequential Set Generation (SSG). Models from both stages are SELDnet-like CRNNs, but with single outputs. Conducted reasoning shows that such two single-output models are fit for SELD task. We show that our solution improves on the baseline system in most metrics on the STARSS22 Dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
