Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event   Detection with Heterogeneous Datasets

Florian Schmid; Paul Primus; Tobias Morocutti; Jonathan Greif; Gerhard; Widmer

arXiv:2407.12997·eess.AS·July 19, 2024

Multi-Iteration Multi-Stage Fine-Tuning of Transformers for Sound Event Detection with Heterogeneous Datasets

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard, Widmer

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multi-iteration, multi-stage fine-tuning method for transformers in sound event detection, leveraging heterogeneous datasets and pseudo-labeling to achieve state-of-the-art results in the DCASE 2024 challenge.

Contribution

It presents a novel iterative training procedure combining self-supervised learning, pseudo-labeling, and distillation for improved sound event detection with transformers.

Findings

01

Achieved a PSDS1 of 0.692 on DESED, setting a new state-of-the-art.

02

Ranked first in Task 4 of the DCASE 2024 Challenge.

03

Demonstrated effectiveness of multi-stage fine-tuning with pseudo-labels.

Abstract

A central problem in building effective sound event detection systems is the lack of high-quality, strongly annotated sound event datasets. For this reason, Task 4 of the DCASE 2024 challenge proposes learning from two heterogeneous datasets, including audio clips labeled with varying annotation granularity and with different sets of possible events. We propose a multi-iteration, multi-stage procedure for fine-tuning Audio Spectrogram Transformers on the joint DESED and MAESTRO Real datasets. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of fine-tuned transformers. Then, in a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cpjku/cpjku_dcase24
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis

MethodsSparse Evolutionary Training