An empirical study of weakly supervised audio tagging embeddings for   general audio representations

Heinrich Dinkel; Zhiyong Yan; Yongqing Wang; Junbo Zhang; Yujun Wang

arXiv:2209.15167·cs.SD·October 3, 2022

An empirical study of weakly supervised audio tagging embeddings for general audio representations

Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

PDF

TL;DR

This study evaluates the effectiveness of pre-trained weakly supervised audio tagging models as feature extractors for diverse audio tasks, comparing them with self-supervised methods across fourteen benchmarks.

Contribution

It provides a comprehensive benchmark showing that weakly supervised audio tagging models are effective for transfer learning in various audio classification tasks.

Findings

01

AT pre-trained models excel in music, event, and emotion recognition.

02

Finetuning AT models improves speech-related tasks like keyword spotting.

03

Weakly supervised models are competitive alternatives to self-supervised methods.

Abstract

We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning methods (BYOL-A) as feature extractors. Fourteen downstream tasks are used for evaluation ranging from music instrument classification to language classification. Our results indicate that AT pre-trained models are an excellent transfer learning choice for music, event, and emotion recognition tasks. Further, finetuning AT models can also benefit speech-related tasks such as keyword spotting and intent classification.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.