Automatic Pronunciation Generation by Utilizing a Semi-supervised Deep Neural Networks
Naoya Takahashi, Tofigh Naghibi, Beat Pfister

TL;DR
This paper introduces a semi-supervised deep neural network approach for automatic pronunciation generation that improves speech recognition accuracy by jointly estimating sub-word units and dictionaries from orthographic transcriptions.
Contribution
It presents a novel data-driven method that reduces reliance on handcrafted pronunciation dictionaries and handles pronunciation variations effectively.
Findings
Outperforms phoneme-based recognition on TIMIT dataset
Reduces effort in dictionary creation and error correction
Enhances recognition accuracy in under-resourced languages
Abstract
Phonemic or phonetic sub-word units are the most commonly used atomic elements to represent speech signals in modern ASRs. However they are not the optimal choice due to several reasons such as: large amount of effort required to handcraft a pronunciation dictionary, pronunciation variations, human mistakes and under-resourced dialects and languages. Here, we propose a data-driven pronunciation estimation and acoustic modeling method which only takes the orthographic transcription to jointly estimate a set of sub-word units and a reliable dictionary. Experimental results show that the proposed method which is based on semi-supervised training of a deep neural network largely outperforms phoneme based continuous speech recognition on the TIMIT dataset.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
