Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like   Spontaneous Representation

Xinhan Di; Zihao Chen; Yunming Liang; Junjie Zheng; Yihua Wang,; Chaofan Ding

arXiv:2408.00284·cs.CL·August 2, 2024

Bailing-TTS: Chinese Dialectal Speech Synthesis Towards Human-like Spontaneous Representation

Xinhan Di, Zihao Chen, Yunming Liang, Junjie Zheng, Yihua Wang,, Chaofan Ding

PDF

Open Access

TL;DR

Bailing-TTS is a large-scale Chinese dialectal speech synthesis model that uses semi-supervised learning and a specialized transformer architecture to generate high-quality, human-like spontaneous speech.

Contribution

It introduces a novel semi-supervised training framework and a transformer-based architecture specifically designed for Chinese dialectal speech synthesis.

Findings

01

Achieves high-quality dialectal speech generation

02

Demonstrates effective alignment of text and speech tokens

03

Produces human-like spontaneous speech results

Abstract

Large-scale text-to-speech (TTS) models have made significant progress recently.However, they still fall short in the generation of Chinese dialectal speech. Toaddress this, we propose Bailing-TTS, a family of large-scale TTS models capable of generating high-quality Chinese dialectal speech. Bailing-TTS serves as a foundation model for Chinese dialectal speech generation. First, continual semi-supervised learning is proposed to facilitate the alignment of text tokens and speech tokens. Second, the Chinese dialectal representation learning is developed using a specific transformer architecture and multi-stage training processes. With the proposed design of novel network architecture and corresponding strategy, Bailing-TTS is able to generate Chinese dialectal speech from text effectively and efficiently. Experiments demonstrate that Bailing-TTS generates Chinese dialectal speech towards…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis