PromptTTS 2: Describing and Generating Voices with Text Prompt
Yichong Leng, Zhifang Guo, Kai Shen, Xu Tan, Zeqian Ju, Yanqing Liu,, Yufei Liu, Dongchao Yang, Leying Zhang, Kaitao Song, Lei He, Xiang-Yang Li,, Sheng Zhao, Tao Qin, Jiang Bian

TL;DR
PromptTTS 2 introduces a novel approach combining a variation network and LLM-based prompt generation to improve voice diversity and quality in text prompt-based speech synthesis, reducing data labeling costs.
Contribution
It presents a new framework that enhances voice variability and prompt quality in TTS using a variation network and large language models, addressing key limitations of previous text prompt methods.
Findings
More consistent voice generation with text prompts
Supports diverse voice variability sampling
Reduces large-scale data labeling costs
Abstract
Speech conveys more information than text, as the same word can be uttered in various voices to convey diverse information. Compared to traditional text-to-speech (TTS) methods relying on speech prompts (reference speech) for voice variability, using text prompts (descriptions) is more user-friendly since speech prompts can be hard to find or may not exist at all. TTS approaches based on the text prompt face two main challenges: 1) the one-to-many problem, where not all details about voice variability can be described in the text prompt, and 2) the limited availability of text prompt datasets, where vendors and large cost of data labeling are required to write text prompts for speech. In this work, we introduce PromptTTS 2 to address these challenges with a variation network to provide variability information of voice not captured by text prompts, and a prompt generation pipeline to…
Peer Reviews
Decision·ICLR 2024 poster
1. This paper studies an interesting problem which enables creation of voices through text descriptions. This line of research has great potential of making speech generation more customizable. 2. The authors present a systematic pipeline to produce text describing four aspects of speech, addressing the data scarcity problem. Ablation studying Table 7 shows the benefit of the step-by-step generation process. 3. The variation model tackles the one-to-many problem. The author verified that when
1. I am not certain if the proposed model and the baseline models are trained on the same data, and hence I cannot draw conclusions that whether the proposed model outperforms the baselines because of the additional LLM generated data or because of the introduction of the variational network to address the one-to-many problem. It would be good to show how well the baseline performs with and without LLM-augmented text prmopts 2. Given that the number of attribute combinations is rather small (2
The claims are tested and it seems the proposed model adds quite a bit of variability, as it was intended.
So many details are left out, as this would not really be possible to fit into the paper. So without the exact code + recipe to produce all the results, it will be almost impossible to reproduce the results. I think having code + recipe available here is very important. All the experiments basically just show that the proposed model works well and solves the outlined problem. However, there is almost no analysis or ablation studies, etc. E.g. how important is it to use a diffusion model here? W
The proposed modeling and data labeling pipelines for text-prompt based TTS systems can generate higher-quality speech with more consistent and noticeable control compared to previous systems. The variation network predicts speech representations that are more closely corresponding to the text prompt and more diversity by sampling from Gaussian noise. On the other hand, the LLM-based prompt generation pipeline can produce high-quality text prompts at scale and can easily incorporate new attribut
After listening to the generated voices on the demo page, audio quality is still an issue and further improvements are required, especially for certain text prompts such as "Please speak at a fast speed, gentleman". The reason could be missing or few audio samples for corresponding prompts in training datasets.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
