Conditioned Time-Dilated Convolutions for Sound Event Detection
Konstantinos Drossos, Stylianos I. Mimilakis, Tuomas Virtanen

TL;DR
This paper introduces conditioned time-dilated convolutions for sound event detection, improving performance by integrating prediction embeddings into the convolution process, leading to higher accuracy and lower error rates.
Contribution
It proposes a novel conditioning algorithm for time-dilated convolutions in SED, enhancing detection accuracy over previous methods.
Findings
Achieved a 2% increase in F1 score (0.63 to 0.65)
Reduced error rate by 3% (0.50 to 0.47)
Validated on TUT-SED Synthetic dataset
Abstract
Sound event detection (SED) is the task of identifying sound events along with their onset and offset times. A recent, convolutional neural networks based SED method, proposed the usage of depthwise separable (DWS) and time-dilated convolutions. DWS and time-dilated convolutions yielded state-of-the-art results for SED, with considerable small amount of parameters. In this work we propose the expansion of the time-dilated convolutions, by conditioning them with jointly learned embeddings of the SED predictions by the SED classifier. We present a novel algorithm for the conditioning of the time-dilated convolutions which functions similarly to language modelling, and enhances the performance of the these convolutions. We employ the freely available TUT-SED Synthetic dataset, and we assess the performance of our method using the average per-frame score and average per-frame…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
