Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection
Pengfei Cai, Yan Song, Nan Jiang, Qing Gu, Ian McLoughlin

TL;DR
This paper presents a novel self-supervised learning approach for sound event detection using prototype-based masked audio modeling, significantly improving performance by leveraging unlabeled data with pseudo labels and a Transformer architecture.
Contribution
Introduction of the Prototype based Masked Audio Model (PMAM) that uses Gaussian mixture model-derived pseudo labels for self-supervised learning in SED, outperforming existing methods.
Findings
Achieved a PSDS1 score of 62.5% on DESED, surpassing state-of-the-art.
Effective use of pseudo labels improves unlabeled data exploitation.
Fine-tuning with minimal labeled data yields high performance.
Abstract
A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing
MethodsInfoNCE
