Fine-tune the pretrained ATST model for sound event detection

Nian Shao; Xian Li; Xiaofei Li

arXiv:2309.08153·eess.AS·January 1, 2024

Fine-tune the pretrained ATST model for sound event detection

Nian Shao, Xian Li, Xiaofei Li

PDF

Open Access 1 Repo

TL;DR

This paper explores fine-tuning a large pretrained self-supervised audio model, ATST-Frame, for sound event detection, achieving state-of-the-art results by effectively adapting the model with both labeled and unlabeled data.

Contribution

It introduces a novel fine-tuning method for the ATST-Frame model in SED, overcoming overfitting and setting new performance benchmarks.

Findings

01

Achieved new SOTA PSDS1/PSDS2 scores of 0.587/0.812 on DCASE dataset.

02

Proposed a fine-tuning approach that utilizes both labeled and unlabeled data.

03

Demonstrated effective adaptation of a large pretrained model for SED tasks.

Abstract

Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However, the pretrained models are regarded as a frozen feature extractor in the challenge baseline system and most of the challenge submissions, and fine-tuning of the pretrained models has been rarely studied. In this work, we study the fine-tuning method of the pretrained models for SED. We first introduce ATST-Frame, our newly proposed SelfSL model, to the SED system. ATST-Frame was especially designed for learning frame-level representations of audio signals and obtained state-of-the-art (SOTA) performances on a series of downstream tasks. We then propose a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Audio-WestlakeU/ATST-SED
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis