MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked   Language Modelling methods for learning Speech Representations

Hemant Yadav; Sunayana Sitaram; Rajiv Ratn Shah

arXiv:2406.05661·cs.CL·February 19, 2025

MS-HuBERT: Mitigating Pre-training and Inference Mismatch in Masked Language Modelling methods for learning Speech Representations

Hemant Yadav, Sunayana Sitaram, Rajiv Ratn Shah

PDF

Open Access 1 Models

TL;DR

MS-HuBERT introduces a novel approach to mitigate pre-training and inference mismatch in HuBERT, enhancing speech representations and improving ASR performance by 5% on Librispeech benchmark.

Contribution

The paper proposes the Swap method and Multicluster masked prediction loss to improve HuBERT's pre-training and inference alignment, leading to better speech representations.

Findings

01

MS-HuBERT outperforms vanilla HuBERT by 5% on Librispeech.

02

Learned embeddings encode essential information for ASR.

03

Proposed methods improve model capacity utilization.

Abstract

In recent years, self-supervised pre-training methods have gained significant traction in learning high-level information from raw speech. Among these methods, HuBERT has demonstrated SOTA performance in automatic speech recognition (ASR). However, HuBERT's performance lags behind data2vec due to disparities in pre-training strategies. In this paper, we propose (i) a Swap method to address pre-training and inference mismatch observed in HuBERT and (ii) incorporates Multicluster masked prediction loss for more effective utilization of the models capacity. The resulting method is, MS-HuBERT, an end-to-end self-supervised pre-training method for learning robust speech representations. It beats vanilla HuBERT on the ASR Librispeech benchmark on average by a 5% margin when evaluated on different finetuning splits. Additionally, we demonstrate that the learned embeddings obtained during…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
s3prl/MS-HuBERT
model· ♡ 1
♡ 1

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling