Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and   Self-training of Neural Transducer

Cong-Thanh Do; Mohan Li; and Rama Doddipatla

arXiv:2207.14736·cs.CL·August 1, 2022

Multiple-hypothesis RNN-T Loss for Unsupervised Fine-tuning and Self-training of Neural Transducer

Cong-Thanh Do, Mohan Li, and Rama Doddipatla

PDF

Open Access

TL;DR

This paper introduces a multiple-hypothesis RNN-T loss function to improve unsupervised fine-tuning and self-training of speech recognition models, reducing errors caused by ASR hypothesis inaccuracies.

Contribution

It proposes a novel multiple-hypothesis loss for RNN-T models that mitigates ASR errors during unsupervised training and fine-tuning, outperforming single-hypothesis methods.

Findings

01

14.2% relative WER reduction on Librispeech test set

02

3.3% relative WER reduction on CHiME-4 noisy data

03

Effective in both fine-tuning and self-training scenarios

Abstract

This paper proposes a new approach to perform unsupervised fine-tuning and self-training using unlabeled speech data for recurrent neural network (RNN)-Transducer (RNN-T) end-to-end (E2E) automatic speech recognition (ASR) systems. Conventional systems perform fine-tuning/self-training using ASR hypothesis as the targets when using unlabeled audio data and are susceptible to the ASR performance of the base model. Here in order to alleviate the influence of ASR errors while using unlabeled data, we propose a multiple-hypothesis RNN-T loss that incorporates multiple ASR 1-best hypotheses into the loss function. For the fine-tuning task, ASR experiments on Librispeech show that the multiple-hypothesis approach achieves a relative reduction of 14.2% word error rate (WER) when compared to the single-hypothesis approach, on the test_other set. For the self-training task, ASR models are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Neural Networks and Applications · Speech and Audio Processing

MethodsBalanced Selection