An empirical study of weakly supervised audio tagging embeddings for general audio representations
Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang

TL;DR
This study evaluates the effectiveness of pre-trained weakly supervised audio tagging models as feature extractors for diverse audio tasks, comparing them with self-supervised methods across fourteen benchmarks.
Contribution
It provides a comprehensive benchmark showing that weakly supervised audio tagging models are effective for transfer learning in various audio classification tasks.
Findings
AT pre-trained models excel in music, event, and emotion recognition.
Finetuning AT models improves speech-related tasks like keyword spotting.
Weakly supervised models are competitive alternatives to self-supervised methods.
Abstract
We study the usability of pre-trained weakly supervised audio tagging (AT) models as feature extractors for general audio representations. We mainly analyze the feasibility of transferring those embeddings to other tasks within the speech and sound domains. Specifically, we benchmark weakly supervised pre-trained models (MobileNetV2 and EfficientNet-B0) against modern self-supervised learning methods (BYOL-A) as feature extractors. Fourteen downstream tasks are used for evaluation ranging from music instrument classification to language classification. Our results indicate that AT pre-trained models are an excellent transfer learning choice for music, event, and emotion recognition tasks. Further, finetuning AT models can also benefit speech-related tasks such as keyword spotting and intent classification.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
