Improving Audio Spectrogram Transformers for Sound Event Detection   Through Multi-Stage Training

Florian Schmid; Paul Primus; Tobias Morocutti; Jonathan Greif; Gerhard; Widmer

arXiv:2408.00791·eess.AS·August 5, 2024

Improving Audio Spectrogram Transformers for Sound Event Detection Through Multi-Stage Training

Florian Schmid, Paul Primus, Tobias Morocutti, Jonathan Greif, Gerhard, Widmer

PDF

Open Access 1 Repo

TL;DR

This paper enhances sound event detection by multi-stage training of large Audio Spectrogram Transformers, leveraging pseudo-labels and pre-training on strongly labeled AudioSet data.

Contribution

It introduces a multi-stage training approach with pseudo-labeling and pre-training, significantly improving transformer-based sound event detection performance.

Findings

01

Performance boosted by pseudo-labeling and iterative training.

02

Pre-training on strongly labeled AudioSet improves accuracy.

03

Ensemble of transformers yields robust pseudo-labels.

Abstract

This technical report describes the CP-JKU team's submission for Task 4 Sound Event Detection with Heterogeneous Training Datasets and Potentially Missing Labels of the DCASE 24 Challenge. We fine-tune three large Audio Spectrogram Transformers, PaSST, BEATs, and ATST, on the joint DESED and MAESTRO datasets in a two-stage training procedure. The first stage closely matches the baseline system setup and trains a CRNN model while keeping the large pre-trained transformer model frozen. In the second stage, both CRNN and transformer are fine-tuned using heavily weighted self-supervised losses. After the second stage, we compute strong pseudo-labels for all audio clips in the training set using an ensemble of all three fine-tuned transformers. Then, in a second iteration, we repeat the two-stage training process and include a distillation loss based on the pseudo-labels, boosting…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cpjku/cpjku_dcase24
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing

MethodsSparse Evolutionary Training