Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
Jiaxu He, Chao Wang, Jie Lian, Yuqing Cai, Yongxiang Li, Renzeg Duojie, Jie Li

TL;DR
This paper introduces a large-model Tibetan TTS system that effectively addresses low-resource challenges, dialectal variation, and complex text-pronunciation mapping, achieving high-quality speech synthesis.
Contribution
It presents the first large-model Tibetan TTS system utilizing data enhancement, text representation adaptation, and cross-lingual training for low-resource settings.
Findings
Achieved MOS scores of 4.28 and 4.35 for syllable-level and BPE-based systems.
Pronunciation accuracy of 97.6% and 96.6%, outperforming commercial TTS.
Demonstrated the effectiveness of large-model adaptation for Tibetan speech synthesis.
Abstract
Tibetan text-to-speech (TTS) has long been challenged by scarce speech resources, significant dialectal variation, and the complex mapping between written text and spoken pronunciation. To address these issues, this work presents, to the best of our knowledge, the first large-model-based Tibetan TTS system in the industry, built upon a large speech synthesis model developed by Xingchen AGI Lab. The proposed system integrates data quality enhancement, Tibetan-oriented text representation and tokenizer adaptation, and cross-lingual adaptive training for low-resource Tibetan speech synthesis. Experimental results show that the system can generate stable, natural, and intelligible Tibetan speech under low-resource conditions. In subjective evaluation, the MOS scores of the syllable-level and BPE-based systems reach 4.28 and 4.35, while their pronunciation accuracies reach 97.6% and 96.6%,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
