CoLoC: Conditioned Localizer and Classifier for Sound Event Localization   and Detection

S{\l}awomir Kapka; Jakub Tkaczuk

arXiv:2210.13932·cs.SD·October 26, 2022

CoLoC: Conditioned Localizer and Classifier for Sound Event Localization and Detection

S{\l}awomir Kapka, Jakub Tkaczuk

PDF

Open Access

TL;DR

This paper introduces CoLoC, a two-stage neural network approach for sound event localization and detection that improves accuracy by conditioning classification on localization outputs and handling an unknown number of sources.

Contribution

The paper presents a novel two-stage SELD model with conditioned classification and source number estimation, outperforming baseline systems on the STARSS22 dataset.

Findings

01

Improved metrics over baseline on STARSS22 dataset

02

Effective handling of unknown number of sound sources

03

Two single-output models are suitable for SELD tasks

Abstract

In this article, we describe Conditioned Localizer and Classifier (CoLoC) which is a novel solution for Sound Event Localization and Detection (SELD). The solution constitutes of two stages: the localization is done first and is followed by classification conditioned by the output of the localizer. In order to resolve the problem of the unknown number of sources we incorporate the idea borrowed from Sequential Set Generation (SSG). Models from both stages are SELDnet-like CRNNs, but with single outputs. Conducted reasoning shows that such two single-output models are fit for SELD task. We show that our solution improves on the baseline system in most metrics on the STARSS22 Dataset.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing