Towards Prosodically Informed Mizo TTS without Explicit Tone Markings

Abhijit Mohanta; Remruatpuii; Priyankoo Sarmah; Rohit Sinha; Wendy Lalhminghlui

arXiv:2601.02073·eess.AS·January 6, 2026

Towards Prosodically Informed Mizo TTS without Explicit Tone Markings

Abhijit Mohanta, Remruatpuii, Priyankoo Sarmah, Rohit Sinha, Wendy Lalhminghlui

PDF

Open Access

TL;DR

This paper presents a low-resource Mizo TTS system that leverages prosodic information without explicit tone markings, demonstrating acceptable quality and tone accuracy with minimal data.

Contribution

It introduces a prosodically informed, end-to-end TTS approach for Mizo that does not rely on explicit tone annotations, achieving competitive quality with limited data.

Findings

01

VITS outperforms Tacotron2 in tone accuracy and overall quality

02

The system achieves perceptually acceptable speech with only 5.18 hours of data

03

Non-autoregressive models can effectively synthesize tonal languages

Abstract

This paper reports on the development of a text-to-speech (TTS) system for Mizo, a low-resource, tonal, and Tibeto-Burman language spoken primarily in the Indian state of Mizoram. The TTS was built with only 5.18 hours of data; however, in terms of subjective and objective evaluations, the outputs were considered perceptually acceptable and intelligible. A baseline model using Tacotron2 was built, and then, with the same data, another TTS model was built with VITS. In both subjective and objective evaluations, the VITS model outperformed the Tacotron2 model. In terms of tone synthesis, the VITS model showed significantly lower tone errors than the Tacotron2 model. The paper demonstrates that a non-autoregressive, end-to-end framework can achieve synthesis of acceptable perceptual quality and intelligibility.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Phonetics and Phonology Research · Speech and Audio Processing