DiffSED: Sound Event Detection with Denoising Diffusion
Swapnil Bhosale, Sauradip Nag, Diptesh Kanojia, Jiankang Deng, Xiatian, Zhu

TL;DR
DiffSED introduces a generative diffusion-based approach for sound event detection, transforming the problem into a denoising process that improves boundary prediction accuracy and training efficiency.
Contribution
This work pioneers a generative modeling framework for SED using denoising diffusion, contrasting with traditional discriminative methods, and demonstrates superior performance and faster training.
Findings
Outperforms existing methods on Urban-SED and EPIC-Sounds datasets.
Achieves over 40% faster convergence during training.
Significantly improves boundary detection accuracy.
Abstract
Sound Event Detection (SED) aims to predict the temporal boundaries of all the events of interest and their class labels, given an unconstrained audio sample. Taking either the splitand-classify (i.e., frame-level) strategy or the more principled event-level modeling approach, all existing methods consider the SED problem from the discriminative learning perspective. In this work, we reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process, conditioned on a target audio sample. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions in the elegant Transformer decoder framework. Doing so enables the model generate accurate event boundaries from even noisy queries during inference. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection
