Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
Surabhi Punjabi, Harish Arsikere, Sri Garimella

TL;DR
This paper proposes using neural machine translation for data augmentation to bootstrap conversational speech recognition models in new languages, demonstrating significant WER reduction especially in underrepresented interaction scenarios.
Contribution
It introduces domain adaptation techniques for effective MT-based data augmentation in speech recognition, addressing challenges like domain mismatch and named entities.
Findings
Achieved 7.8-15.6% relative WER reduction using MT-based augmentation.
Translation particularly improves underrepresented interaction scenarios.
Domain adaptation techniques enhance the effectiveness of neural machine translation for speech data.
Abstract
Building conversational speech recognition systems for new languages is constrained by the availability of utterances that capture user-device interactions. Data collection is both expensive and limited by the speed of manual transcription. In order to address this, we advocate the use of neural machine translation as a data augmentation technique for bootstrapping language models. Machine translation (MT) offers a systematic way of incorporating collections from mature, resource-rich conversational systems that may be available for a different language. However, ingesting raw translations from a general purpose MT system may not be effective owing to the presence of named entities, intra sentential code-switching and the domain mismatch between the conversational data being translated and the parallel text used for MT training. To circumvent this, we explore the following domain…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsSPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
