The Efficacy of Self-Supervised Speech Models for Audio Representations
Tung-Yu Wu, Chen-An Li, Tzu-Han Lin, Tsu-Yuan Hsu, Hung-Yi Lee

TL;DR
This paper demonstrates that self-supervised speech models, when combined with ensemble techniques, can effectively produce robust audio representations for both speech and non-speech tasks, surpassing existing models in many cases.
Contribution
The authors propose an ensemble framework for SSL speech models, showing its effectiveness across diverse audio datasets and outperforming state-of-the-art models in the HEAR Challenge.
Findings
SSL speech models perform well on non-speech tasks
Ensemble techniques improve representation quality
Framework surpasses state-of-the-art models in HEAR Challenge
Abstract
Self-supervised learning (SSL) speech models, which can serve as powerful upstream models to extract meaningful speech representations, have achieved unprecedented success in speech representation learning. However, their effectiveness on non-speech datasets is relatively less explored. In this work, we propose an ensemble framework, with a combination of ensemble techniques, to fuse SSL speech models' embeddings. Extensive experiments on speech and non-speech audio datasets are conducted to investigate the representation abilities of our ensemble method and its single constituent model. Ablation studies are carried out to evaluate the performances of different ensemble techniques, such as feature averaging and concatenation. All experiments are conducted during NeurIPS 2021 HEAR Challenge as a standard evaluation pipeline provided by competition officials. Results demonstrate SSL…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
