PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit   Training for Phonetic-Reduction-Robust E2E Speech Recognition

Guodong Ma; Pengfei Hu; Nurmemet Yolwas; Shen Huang; Hao Huang

arXiv:2112.06721·cs.SD·July 5, 2022

PM-MMUT: Boosted Phone-Mask Data Augmentation using Multi-Modeling Unit Training for Phonetic-Reduction-Robust E2E Speech Recognition

Guodong Ma, Pengfei Hu, Nurmemet Yolwas, Shen Huang, Hao Huang

PDF

Open Access

TL;DR

This paper introduces PM-MMUT, a novel architecture that combines multi-modeling units with phone masking training to improve robustness in end-to-end speech recognition, especially for reduced speech sounds.

Contribution

It proposes a multi-modeling unit training framework fused with phone masking training, enhancing phoneme-level context learning for more accurate ASR.

Findings

01

Outperforms pure PMT in Uyghur ASR tasks.

02

Achieves about 10% relative WER reduction on Librispeech without LM fusion.

03

Demonstrates effectiveness across different languages and datasets.

Abstract

Consonant and vowel reduction are often encountered in speech, which might cause performance degradation in automatic speech recognition (ASR). Our recently proposed learning strategy based on masking, Phone Masking Training (PMT), alleviates the impact of such phenomenon in Uyghur ASR. Although PMT achieves remarkably improvements, there still exists room for further gains due to the granularity mismatch between the masking unit of PMT (phoneme) and the modeling unit (word-piece). To boost the performance of PMT, we propose multi-modeling unit training (MMUT) architecture fusion with PMT (PM-MMUT). The idea of MMUT framework is to split the Encoder into two parts including acoustic feature sequences to phoneme-level representation (AF-to-PLR) and phoneme-level representation to word-piece-level representation (PLR-to-WPLR). It allows AF-to-PLR to be optimized by an intermediate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsConnectionist Temporal Classification Loss