DNN-based Speech Synthesis for Indian Languages from ASCII text
Srikanth Ronanki, Siva Reddy, Bajibabu Bollepalli, Simon King

TL;DR
This paper explores deep neural network-based methods to synthesize speech from noisy ASCII transliterations of Indian languages, demonstrating competitive quality across Hindi, Tamil, and Telugu, and releasing datasets publicly.
Contribution
It evaluates three approaches for converting ASCII transliterations to speech using DNNs, addressing the challenge of noisy, non-standard text input.
Findings
Models produce speech comparable to native script-based synthesis.
Supervised G2P approach outperforms naive methods.
Datasets are publicly released for further research.
Abstract
Text-to-Speech synthesis in Indian languages has a seen lot of progress over the decade partly due to the annual Blizzard challenges. These systems assume the text to be written in Devanagari or Dravidian scripts which are nearly phonemic orthography scripts. However, the most common form of computer interaction among Indians is ASCII written transliterated text. Such text is generally noisy with many variations in spelling for the same word. In this paper we evaluate three approaches to synthesize speech from such noisy ASCII text: a naive Uni-Grapheme approach, a Multi-Grapheme approach, and a supervised Grapheme-to-Phoneme (G2P) approach. These methods first convert the ASCII text to a phonetic script, and then learn a Deep Neural Network to synthesize speech from that. We train and test our models on Blizzard Challenge datasets that were transliterated to ASCII using crowdsourcing.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
