Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning
Siqi Sun, Korin Richmond

TL;DR
This paper introduces a multi-task learning approach to improve pronunciation knowledge acquisition from transcribed speech audio, simplifying the training process while maintaining high accuracy in pronunciation prediction.
Contribution
It presents a novel MTL-based method that leverages transcribed speech audio for pronunciation learning, reducing complexity compared to previous auxiliary ASR-based approaches.
Findings
PER reduced from 2.5% to 1.6% for specific word types
Achieves similar performance to previous methods
Simplifies implementation flow
Abstract
Recent work has shown the feasibility and benefit of bootstrapping an integrated sequence-to-sequence (Seq2Seq) linguistic frontend from a traditional pipeline-based frontend for text-to-speech (TTS). To overcome the fixed lexical coverage of bootstrapping training data, previous work has proposed to leverage easily accessible transcribed speech audio as an additional training source for acquiring novel pronunciation knowledge for uncovered words, which relies on an auxiliary ASR model as part of a cumbersome implementation flow. In this work, we propose an alternative method to leverage transcribed speech audio as an additional training source, based on multi-task learning (MTL). Experiments show that, compared to a baseline Seq2Seq frontend, the proposed MTL-based method reduces PER from 2.5% to 1.6% for those word types covered exclusively in transcribed speech audio, achieving a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsTanh Activation · Sigmoid Activation · Long Short-Term Memory · Sequence to Sequence
