TL;DR
This paper introduces a novel T-F segmentation framework for sound event detection using weakly labelled data, enabling event recognition, detection, and separation without detailed annotations.
Contribution
It proposes a CNN-based T-F segmentation approach trained on weak labels, allowing simultaneous sound event detection and separation, outperforming previous methods.
Findings
Achieved higher F1 scores in audio tagging and SED compared to baselines.
Enabled T-F segmentation with an F1 score of 0.218, not previously possible.
Produced separated sound waveforms from weakly labelled data.
Abstract
Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
