SP-SEDT: Self-supervised Pre-training for Sound Event Detection   Transformer

Zhirong Ye; Xiangdong Wang; Hong Liu; Yueliang Qian; Rui Tao; Long; Yan; Kazushige Ouchi

arXiv:2111.15222·cs.SD·April 7, 2022

SP-SEDT: Self-supervised Pre-training for Sound Event Detection Transformer

Zhirong Ye, Xiangdong Wang, Hong Liu, Yueliang Qian, Rui Tao, Long, Yan, Kazushige Ouchi

PDF

Open Access 2 Repos

TL;DR

This paper introduces SP-SEDT, a self-supervised pre-training method for sound event detection using a transformer, which improves localization and performance without extensive labeled data.

Contribution

It proposes a novel self-supervised pre-training approach for SEDT based on patch detection, reducing reliance on large annotated datasets and domain gap issues.

Findings

01

SP-SEDT outperforms fine-tuned frame-based models on DCASE2019.

02

Self-supervised pre-training enhances sound event detection accuracy.

03

Ablation studies reveal optimal loss functions and patch sizes.

Abstract

Recently, an event-based end-to-end model (SEDT) has been proposed for sound event detection (SED) and achieves competitive performance. However, compared with the frame-based model, it requires more training data with temporal annotations to improve the localization ability. Synthetic data is an alternative, but it suffers from a great domain gap with real recordings. Inspired by the great success of UP-DETR in object detection, we propose to self-supervisedly pre-train SEDT (SP-SEDT) by detecting random patches (only cropped along the time axis). Experiments on the DCASE2019 task4 dataset show the proposed SP-SEDT can outperform fine-tuned frame-based model. The ablation study is also conducted to investigate the impact of different loss functions and patch size.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Speech Recognition and Synthesis