TL;DR
This paper introduces a duration robust CRNN framework for weakly-supervised sound event detection that performs well without prior duration knowledge, especially on datasets with short events, using novel post-processing and data augmentation techniques.
Contribution
The paper proposes a new CRNN-based model with a Triple Threshold post-processing strategy and data augmentation methods, improving localization performance in weakly-supervised SED without needing duration labels.
Findings
Outperforms existing methods on DCASE2018 and URBAN-SED datasets.
Achieves similar performance to supervised models on URBAN-SED.
Post-processing significantly reduces localization performance drop.
Abstract
Sound event detection (SED) is the task of tagging the absence or presence of audio events and their corresponding interval within a given audio clip. While SED can be done using supervised machine learning, where training data is fully labeled with access to per event timestamps and duration, our work focuses on weakly-supervised sound event detection (WSSED), where prior knowledge about an event's duration is unavailable. Recent research within the field focuses on improving segment- and event-level localization performance for specific datasets regarding specific evaluation metrics. Specifically, well-performing event-level localization requires fully labeled development subsets to obtain event duration estimates, which significantly benefits localization performance. Moreover, well-performing segment-level localization models output predictions at a coarse-scale (e.g., 1 second),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
