The Efficacy of Self-Supervised Speech Models for Audio Representations

Tung-Yu Wu; Chen-An Li; Tzu-Han Lin; Tsu-Yuan Hsu; Hung-Yi Lee

arXiv:2209.12900·cs.SD·February 1, 2023

The Efficacy of Self-Supervised Speech Models for Audio Representations

Tung-Yu Wu, Chen-An Li, Tzu-Han Lin, Tsu-Yuan Hsu, Hung-Yi Lee

PDF

Open Access 1 Repo

TL;DR

This paper demonstrates that self-supervised speech models, when combined with ensemble techniques, can effectively produce robust audio representations for both speech and non-speech tasks, surpassing existing models in many cases.

Contribution

The authors propose an ensemble framework for SSL speech models, showing its effectiveness across diverse audio datasets and outperforming state-of-the-art models in the HEAR Challenge.

Findings

01

SSL speech models perform well on non-speech tasks

02

Ensemble techniques improve representation quality

03

Framework surpasses state-of-the-art models in HEAR Challenge

Abstract

Self-supervised learning (SSL) speech models, which can serve as powerful upstream models to extract meaningful speech representations, have achieved unprecedented success in speech representation learning. However, their effectiveness on non-speech datasets is relatively less explored. In this work, we propose an ensemble framework, with a combination of ensemble techniques, to fuse SSL speech models' embeddings. Extensive experiments on speech and non-speech audio datasets are conducted to investigate the representation abilities of our ensemble method and its single constituent model. Ablation studies are carried out to evaluate the performances of different ensemble techniques, such as feature averaging and concatenation. All experiments are conducted during NeurIPS 2021 HEAR Challenge as a standard evaluation pipeline provided by competition officials. Results demonstrate SSL…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tony10101105/hear-2021-neurips-challenge---ntu-gura
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing