Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation
Xinhan Di, Zihao Chen, Yunming Liang, Junjie Zheng, Yihua Wang,, Chaofan Ding

TL;DR
Bailing-TTS is a large-scale Chinese dialectal speech synthesis model that uses semi-supervised learning and a specialized transformer architecture to generate high-quality, human-like spontaneous speech.
Contribution
It introduces a novel semi-supervised training framework and a transformer-based architecture specifically designed for Chinese dialectal speech synthesis.
Findings
Achieves high-quality dialectal speech generation
Demonstrates effective alignment of text and speech tokens
Produces human-like spontaneous speech results
Abstract
Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis
