Leveraging Phone Mask Training for Phonetic-Reduction-Robust E2E Uyghur Speech Recognition
Guodong Ma, Pengfei Hu, Jian Kang, Shen Huang, Hao Huang

TL;DR
This paper introduces a phone mask training technique for Uyghur speech recognition that improves accuracy by simulating phonetic reduction phenomena, especially in spontaneous speech, through strategic masking during model training.
Contribution
The study proposes a novel phone mask training method for Conformer-based Uyghur E2E speech recognition, enhancing robustness against phonetic reduction effects in spontaneous speech.
Findings
Achieved 5.51% relative WER reduction on reading speech.
Achieved 12.92% relative WER reduction on spontaneous speech.
Demonstrated 20% relative improvement on open-source dataset THUYG-20.
Abstract
In Uyghur speech, consonant and vowel reduction are often encountered, especially in spontaneous speech with high speech rate, which will cause a degradation of speech recognition performance. To solve this problem, we propose an effective phone mask training method for Conformer-based Uyghur end-to-end (E2E) speech recognition. The idea is to randomly mask off a certain percentage features of phones during model training, which simulates the above verbal phenomena and facilitates E2E model to learn more contextual information. According to experiments, the above issues can be greatly alleviated. In addition, deep investigations are carried out into different units in masking, which shows the effectiveness of our proposed masking unit. We also further study the masking method and optimize filling strategy of phone mask. Finally, compared with Conformer-based E2E baseline without mask…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
