SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer
Daegyeom Kim, Seongho Hong, and Yong-Hoon Choi

TL;DR
This paper introduces SC VALL-E, a neural codec language model that enables controllable, expressive zero-shot text-to-speech synthesis by manipulating style attributes like emotion and pitch, outperforming existing models.
Contribution
The paper presents a novel style control mechanism within VALL-E, allowing for explicit manipulation of speech attributes in zero-shot synthesis, which was not possible with prior models.
Findings
SC VALL-E achieves competitive quality in expressive speech synthesis.
The model effectively controls style attributes such as emotion and pitch.
It outperforms baseline models in style control and speech quality metrics.
Abstract
Expressive speech synthesis models are trained by adding corpora with diverse speakers, various emotions, and different speaking styles to the dataset, in order to control various characteristics of speech and generate the desired voice. In this paper, we propose a style control (SC) VALL-E model based on the neural codec language model (called VALL-E), which follows the structure of the generative pretrained transformer 3 (GPT-3). The proposed SC VALL-E takes input from text sentences and prompt audio and is designed to generate controllable speech by not simply mimicking the characteristics of the prompt audio but by controlling the attributes to produce diverse voices. We identify tokens in the style embedding matrix of the newly designed style network that represent attributes such as emotion, speaking rate, pitch, and voice intensity, and design a model that can control these…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
