Effective Pre-Training of Audio Transformers for Sound Event Detection

Florian Schmid; Tobias Morocutti; Francesco Foscarin; Jan Schl\"uter,; Paul Primus; Gerhard Widmer

arXiv:2409.09546·eess.AS·December 2, 2024

Effective Pre-Training of Audio Transformers for Sound Event Detection

Florian Schmid, Tobias Morocutti, Francesco Foscarin, Jan Schl\"uter,, Paul Primus, Gerhard Widmer

PDF

Open Access 2 Repos 1 Datasets

TL;DR

This paper introduces a comprehensive pre-training pipeline for audio transformers that significantly enhances sound event detection performance, utilizing advanced data augmentation, balanced sampling, and ensemble distillation techniques.

Contribution

It presents a novel pre-training routine specifically designed for audio spectrogram transformers, improving downstream sound event detection accuracy.

Findings

01

Substantial performance gains on AudioSet frame-level predictions.

02

Effective pre-training pipeline validated on multiple transformer models.

03

Public release of high-performance checkpoints for sound event detection.

Abstract

We propose a pre-training pipeline for audio spectrogram transformers for frame-level sound event detection tasks. On top of common pre-training steps, we add a meticulously designed training routine on AudioSet frame-level annotations. This includes a balanced sampler, aggressive data augmentation, and ensemble knowledge distillation. For five transformers, we obtain a substantial performance improvement over previously available checkpoints both on AudioSet frame-level predictions and on frame-level sound event detection downstream tasks, confirming our pipeline's effectiveness. We publish the resulting checkpoints that researchers can directly fine-tune to build high-performance models for sound event detection tasks.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Datasets

deepvk/NonverbalTTS
dataset· 177 dl
177 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis