Using Synthetic Audio to Improve The Recognition of Out-Of-Vocabulary   Words in End-To-End ASR Systems

Xianrui Zheng; Yulan Liu; Deniz Gunceler; Daniel Willett (Amazon; Alexa)

arXiv:2011.11564·eess.AS·February 11, 2021·1 cites

Using Synthetic Audio to Improve The Recognition of Out-Of-Vocabulary Words in End-To-End ASR Systems

Xianrui Zheng, Yulan Liu, Deniz Gunceler, Daniel Willett (Amazon, Alexa)

PDF

Open Access

TL;DR

This paper introduces a method using synthetic TTS-generated audio to improve recognition of out-of-vocabulary words in end-to-end ASR systems, achieving significant WER reduction without harming overall performance.

Contribution

The study demonstrates that fine-tuning RNN-T models with synthetic OOV audio and elastic weight consolidation enhances OOV word recognition in ASR systems.

Findings

01

57% relative WER reduction on OOV words

02

No degradation on overall test set performance

03

Effective use of synthetic data for OOV recognition improvement

Abstract

Today, many state-of-the-art automatic speech recognition (ASR) systems apply all-neural models that map audio to word sequences trained end-to-end along one global optimisation criterion in a fully data driven fashion. These models allow high precision ASR for domains and words represented in the training material but have difficulties recognising words that are rarely or not at all represented during training, i.e. trending words and new named entities. In this paper, we use a text-to-speech (TTS) engine to provide synthetic audio for out-of-vocabulary (OOV) words. We aim to boost the recognition accuracy of a recurrent neural network transducer (RNN-T) on OOV words by using the extra audio-text pairs, while maintaining the performance on the non-OOV words. Different regularisation techniques are explored and the best performance is achieved by fine-tuning the RNN-T on both original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Music and Audio Processing