Multilingual context-based pronunciation learning for Text-to-Speech
Giulia Comini, Manuel Sam Ribeiro, Fan Yang, Heereen Shim, Jaime, Lorenzo-Trueba

TL;DR
This paper presents a multilingual unified front-end system for TTS that handles various pronunciation tasks, demonstrating competitive performance across languages and challenges, streamlining traditional separate modules.
Contribution
The work introduces a single multilingual model that replaces multiple language-specific modules for pronunciation tasks in TTS systems.
Findings
Competitive performance across multiple languages and tasks
Effective handling of G2P conversion and disambiguation
Some trade-offs compared to monolingual solutions
Abstract
Phonetic information and linguistic knowledge are an essential component of a Text-to-speech (TTS) front-end. Given a language, a lexicon can be collected offline and Grapheme-to-Phoneme (G2P) relationships are usually modeled in order to predict the pronunciation for out-of-vocabulary (OOV) words. Additionally, post-lexical phonology, often defined in the form of rule-based systems, is used to correct pronunciation within or between words. In this work we showcase a multilingual unified front-end system that addresses any pronunciation related task, typically handled by separate modules. We evaluate the proposed model on G2P conversion and other language-specific challenges, such as homograph and polyphones disambiguation, post-lexical rules and implicit diacritization. We find that the multilingual model is competitive across languages and tasks, however, some trade-offs exists when…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
