Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets
Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard, Widmer

TL;DR
This paper introduces a multi-iteration, multi-stage fine-tuning method for transformers in sound event detection, leveraging heterogeneous datasets and pseudo-labeling to achieve state-of-the-art results in the DCASE 2024 challenge.
Contribution
It presents a novel iterative training procedure combining self-supervised learning, pseudo-labeling, and distillation for improved sound event detection with transformers.
Findings
Achieved a PSDS1 of 0.692 on DESED, setting a new state-of-the-art.
Ranked first in Task 4 of the DCASE 2024 Challenge.
Demonstrated effectiveness of multi-stage fine-tuning with pseudo-labels.
Abstract
A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis
MethodsSparse Evolutionary Training
