Joint Unsupervised and Supervised Training for Multilingual ASR

Junwen Bai; Bo Li; Yu Zhang; Ankur Bapna; Nikhil Siddhartha; Khe Chai; Sim; Tara N. Sainath

arXiv:2111.08137·cs.CL·November 17, 2021

Joint Unsupervised and Supervised Training for Multilingual ASR

Junwen Bai, Bo Li, Yu Zhang, Ankur Bapna, Nikhil Siddhartha, Khe Chai, Sim, Tara N. Sainath

PDF

TL;DR

This paper introduces an end-to-end joint training method for multilingual speech recognition that combines supervised and self-supervised learning, outperforming existing methods especially in low-resource languages.

Contribution

The proposed JUST method unifies supervised and self-supervised training in a single framework, improving multilingual ASR performance over traditional two-stage approaches.

Findings

01

Outperforms state-of-the-art methods on MLS dataset.

02

Significantly reduces WER in low-resource languages.

03

Outperforms monolingual baselines and transfer learning methods.

Abstract

Self-supervised training has shown promising gains in pretraining models and facilitating the downstream finetuning for speech recognition, like multilingual ASR. Most existing methods adopt a 2-stage scheme where the self-supervised loss is optimized in the first pretraining stage, and the standard supervised finetuning resumes in the second stage. In this paper, we propose an end-to-end (E2E) Joint Unsupervised and Supervised Training (JUST) method to combine the supervised RNN-T loss and the self-supervised contrastive and masked language modeling (MLM) losses. We validate its performance on the public dataset Multilingual LibriSpeech (MLS), which includes 8 languages and is extremely imbalanced. On MLS, we explore (1) JUST trained from scratch, and (2) JUST finetuned from a pretrained checkpoint. Experiments show that JUST can consistently outperform other existing state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsXLSR