Prototype based Masked Audio Model for Self-Supervised Learning of Sound   Event Detection

Pengfei Cai; Yan Song; Nan Jiang; Qing Gu; Ian McLoughlin

arXiv:2409.17656·cs.SD·September 27, 2024

Prototype based Masked Audio Model for Self-Supervised Learning of Sound Event Detection

Pengfei Cai, Yan Song, Nan Jiang, Qing Gu, Ian McLoughlin

PDF

Open Access 1 Repo

TL;DR

This paper presents a novel self-supervised learning approach for sound event detection using prototype-based masked audio modeling, significantly improving performance by leveraging unlabeled data with pseudo labels and a Transformer architecture.

Contribution

Introduction of the Prototype based Masked Audio Model (PMAM) that uses Gaussian mixture model-derived pseudo labels for self-supervised learning in SED, outperforming existing methods.

Findings

01

Achieved a PSDS1 score of 62.5% on DESED, surpassing state-of-the-art.

02

Effective use of pseudo labels improves unlabeled data exploitation.

03

Fine-tuning with minimal labeled data yields high performance.

Abstract

A significant challenge in sound event detection (SED) is the effective utilization of unlabeled data, given the limited availability of labeled data due to high annotation costs. Semi-supervised algorithms rely on labeled data to learn from unlabeled data, and the performance is constrained by the quality and size of the former. In this paper, we introduce the Prototype based Masked Audio Model~(PMAM) algorithm for self-supervised representation learning in SED, to better exploit unlabeled data. Specifically, semantically rich frame-level pseudo labels are constructed from a Gaussian mixture model (GMM) based prototypical distribution modeling. These pseudo labels supervise the learning of a Transformer-based masked audio model, in which binary cross-entropy loss is employed instead of the widely used InfoNCE loss, to provide independent loss contributions from different prototypes,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cai525/transformer4sed
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech and Audio Processing

MethodsInfoNCE