Cross-Referencing Self-Training Network for Sound Event Detection in Audio Mixtures
Sangwook Park, David K. Han, Mounya Elhilali

TL;DR
This paper introduces a semi-supervised self-training network with cross-referencing for sound event detection, reducing the need for extensive labeled data and improving detection accuracy in audio mixtures.
Contribution
It proposes a novel student-teacher semi-supervised framework with cross-training and post-processing for enhanced sound event detection performance.
Findings
Significant improvement over state-of-the-art semi-supervised methods.
Effective pseudo-label generation from unlabeled data.
Enhanced detection accuracy on DCASE2020 challenge dataset.
Abstract
Sound event detection is an important facet of audio tagging that aims to identify sounds of interest and define both the sound category and time boundaries for each sound event in a continuous recording. With advances in deep neural networks, there has been tremendous improvement in the performance of sound event detection systems, although at the expense of costly data collection and labeling efforts. In fact, current state-of-the-art methods employ supervised training methods that leverage large amounts of data samples and corresponding labels in order to facilitate identification of sound category and time stamps of events. As an alternative, the current study proposes a semi-supervised method for generating pseudo-labels from unsupervised data using a student-teacher scheme that balances self-training and cross-training. Additionally, this paper explores post-processing which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Diverse Musicological Studies · Speech and Audio Processing
