Sound event detection using weakly labeled dataset with stacked convolutional and recurrent neural network
Sharath Adavanne, Tuomas Virtanen

TL;DR
This paper introduces a neural network architecture that learns to detect the start and end times of sound events using only weak labels, by combining convolutional and recurrent layers with a novel training scheme.
Contribution
It presents a stacked convolutional and recurrent neural network with a dual prediction layer approach for weakly supervised sound event detection.
Findings
Achieved an error rate of 0.84 for strong labels
F-score of 43.3% for weak labels on test data
Effective training scheme controlling learning from weak and strong labels
Abstract
This paper proposes a neural network architecture and training scheme to learn the start and end time of sound events (strong labels) in an audio recording given just the list of sound events existing in the audio without time information (weak labels). We achieve this by using a stacked convolutional and recurrent neural network with two prediction layers in sequence one for the strong followed by the weak label. The network is trained using frame-wise log mel-band energy as the input audio feature, and weak labels provided in the dataset as labels for the weak label prediction layer. Strong labels are generated by replicating the weak labels as many number of times as the frames in the input audio feature, and used for strong label layer during training. We propose to control what the network learns from the weak and strong labels by different weighting for the loss computed in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
