Pseudo Strong Labels from Frame-Level Predictions for Weakly Supervised   Sound Event Detection

Yuliang Zhang; Defeng (David) Huang; Roberto Togneri

arXiv:2501.03740·eess.AS·January 8, 2025

Pseudo Strong Labels from Frame-Level Predictions for Weakly Supervised Sound Event Detection

Yuliang Zhang, Defeng (David) Huang, Roberto Togneri

PDF

Open Access

TL;DR

This paper proposes a novel method called Frame-level Pseudo Strong Labeling (FPSL) that generates pseudo labels from weakly labeled data to improve temporal localization in sound event detection, showing significant performance gains across multiple datasets.

Contribution

The study introduces FPSL, a new approach to generate pseudo strong labels from frame-level predictions, enhancing weakly supervised sound event detection performance.

Findings

01

FPSL improves PSDS scores by up to 7.6% on benchmark datasets.

02

CRNNs trained with FPSL outperform baselines in event detection metrics.

03

Significant performance gains demonstrate FPSL's effectiveness in weakly supervised settings.

Abstract

Weakly Supervised Sound Event Detection (WSSED), which relies on audio tags without precise onset and offset times, has become prevalent due to the scarcity of strongly labeled data that includes exact temporal boundaries for events. This study introduces Frame-level Pseudo Strong Labeling (FPSL) to overcome the lack of temporal information in WSSED by generating pseudo strong labels from frame-level predictions. This enhances temporal localization during training and addresses the limitations of clip-wise weak supervision. We validate our approach across three benchmark datasets (DCASE2017 Task 4, DCASE2018 Task 4, and UrbanSED) and demonstrate significant improvements in key metrics such as the Polyphonic Sound Detection Scores (PSDS), event-based F1 scores, and intersection-based F1 scores. For example, Convolutional Recurrent Neural Networks (CRNNs) trained with FPSL outperform…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing · Image and Signal Denoising Methods