Investigation of Ensemble features of Self-Supervised Pretrained Models   for Automatic Speech Recognition

A Arunkumar; Vrunda N Sukhadia; S. Umesh

arXiv:2206.05518·cs.CL·February 21, 2023

Investigation of Ensemble features of Self-Supervised Pretrained Models for Automatic Speech Recognition

A Arunkumar, Vrunda N Sukhadia, S. Umesh

PDF

Open Access

TL;DR

This paper explores combining features from multiple self-supervised pretrained speech models to enhance automatic speech recognition performance, demonstrating improvements over individual models on standard datasets.

Contribution

It introduces an ensemble approach that leverages the complementary features of HuBERT, Wav2vec2.0, and WaveLM for ASR, which is a novel application in this context.

Findings

01

Ensemble of models outperforms individual models in ASR tasks.

02

Using combined features improves recognition accuracy on Librispeech and WSJ datasets.

03

Ensemble methods yield richer feature representations for speech recognition.

Abstract

Self-supervised learning (SSL) based models have been shown to generate powerful representations that can be used to improve the performance of downstream speech tasks. Several state-of-the-art SSL models are available, and each of these models optimizes a different loss which gives rise to the possibility of their features being complementary. This paper proposes using an ensemble of such SSL representations and models, which exploits the complementary nature of the features extracted by the various pretrained models. We hypothesize that this results in a richer feature representation and shows results for the ASR downstream task. To this end, we use three SSL models that have shown excellent results on ASR tasks, namely HuBERT, Wav2vec2.0, and WaveLM. We explore the ensemble of models fine-tuned for the ASR task and the ensemble of features using the embeddings obtained from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing