Controllable Text-to-Speech Synthesis with Masked-Autoencoded Style-Rich Representation
Yongqi Wang, Chunlei Zhang, Hangting Chen, Zhou Zhao, Dong Yu

TL;DR
This paper introduces a two-stage controllable text-to-speech system that uses a masked-autoencoded style-rich representation to improve fine-grained control over speech attributes and speaker characteristics.
Contribution
It presents a novel two-stage TTS framework utilizing a quantized style-rich representation, enhancing control and robustness with extensive training data.
Findings
Improved control over multiple speech attributes.
Enhanced content robustness through extensive training.
Ability to manipulate speaker and style attributes explicitly.
Abstract
Controllable TTS models with natural language prompts often lack the ability for fine-grained control and face a scarcity of high-quality data. We propose a two-stage style-controllable TTS system with language models, utilizing a quantized masked-autoencoded style-rich representation as an intermediary. In the first stage, an autoregressive transformer is used for the conditional generation of these style-rich tokens from text and control signals. The second stage generates codec tokens from both text and sampled style-rich tokens. Experiments show that training the first-stage model on extensive datasets enhances the content robustness of the two-stage model as well as control capabilities over multiple attributes. By selectively combining discrete labels and speaker embeddings, we explore fully controlling the speaker's timbre and other stylistic information, and adjusting attributes…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Natural Language Processing Techniques · Speech Recognition and Synthesis
