DiffSED: Sound Event Detection with Denoising Diffusion

Swapnil Bhosale; Sauradip Nag; Diptesh Kanojia; Jiankang Deng; Xiatian; Zhu

arXiv:2308.07293·cs.SD·August 21, 2023

DiffSED: Sound Event Detection with Denoising Diffusion

Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian, Zhu

PDF

Open Access 1 Video

TL;DR

DiffSED introduces a generative diffusion-based approach for sound event detection, transforming the problem into a denoising process that improves boundary prediction accuracy and training efficiency.

Contribution

This work pioneers a generative modeling framework for SED using denoising diffusion, contrasting with traditional discriminative methods, and demonstrates superior performance and faster training.

Findings

01

Outperforms existing methods on Urban-SED and EPIC-Sounds datasets.

02

Achieves over 40% faster convergence during training.

03

Significantly improves boundary detection accuracy.

Abstract

Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

DiffSED: Sound Event Detection with Denoising Diffusion· underline

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection