TextrolSpeech: A Text Style Control Speech Corpus With Codec Language Text-to-Speech Models
Shengpeng Ji, Jialong Zuo, Minghui Fang, Ziyue Jiang, Feiyang Chen,, Xinyu Duan, Baoxing Huai, Zhou Zhao

TL;DR
This paper introduces TextrolSpeech, a large-scale dataset with natural text style prompts for controllable TTS, and proposes Salle, an architecture that uses codec codes for diverse style speech generation, advancing natural text-driven speech synthesis.
Contribution
The paper presents the first large-scale dataset with natural text style prompts and introduces Salle, a novel architecture that enhances style diversity in controllable TTS using codec codes.
Findings
TextrolSpeech contains 236,220 style prompt-speech pairs.
Salle achieves comparable performance to traditional models in controllable TTS.
The multi-stage prompt programming effectively utilizes GPT for style description generation.
Abstract
Recently, there has been a growing interest in the field of controllable Text-to-Speech (TTS). While previous studies have relied on users providing specific style factor values based on acoustic knowledge or selecting reference speeches that meet certain requirements, generating speech solely from natural text prompts has emerged as a new challenge for researchers. This challenge arises due to the scarcity of high-quality speech datasets with natural text style prompt and the absence of advanced text-controllable TTS models. In light of this, 1) we propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes. The dataset comprises 236,220 pairs of style prompt in natural text descriptions with five style factors and corresponding speech samples. Through iterative experimentation, we introduce a multi-stage prompt programming approach…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech Recognition and Synthesis · Speech and Audio Processing
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Cosine Annealing · Weight Decay · Linear Layer · Attention Dropout · Softmax · Dense Connections · Residual Connection
