Towards Prosodically Informed Mizo TTS without Explicit Tone Markings
Abhijit Mohanta, Remruatpuii, Priyankoo Sarmah, Rohit Sinha, Wendy Lalhminghlui

TL;DR
This paper presents a low-resource Mizo TTS system that leverages prosodic information without explicit tone markings, demonstrating acceptable quality and tone accuracy with minimal data.
Contribution
It introduces a prosodically informed, end-to-end TTS approach for Mizo that does not rely on explicit tone annotations, achieving competitive quality with limited data.
Findings
VITS outperforms Tacotron2 in tone accuracy and overall quality
The system achieves perceptually acceptable speech with only 5.18 hours of data
Non-autoregressive models can effectively synthesize tonal languages
Abstract
This paper reports on the development of a text-to-speech (TTS) system for Mizo, a low-resource, tonal, and Tibeto-Burman language spoken primarily in the Indian state of Mizoram. The TTS was built with only 5.18 hours of data; however, in terms of subjective and objective evaluations, the outputs were considered perceptually acceptable and intelligible. A baseline model using Tacotron2 was built, and then, with the same data, another TTS model was built with VITS. In both subjective and objective evaluations, the VITS model outperformed the Tacotron2 model. In terms of tone synthesis, the VITS model showed significantly lower tone errors than the Tacotron2 model. The paper demonstrates that a non-autoregressive, end-to-end framework can achieve synthesis of acceptable perceptual quality and intelligibility.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing
