SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Daegyeom Kim; Seongho Hong; and Yong-Hoon Choi

arXiv:2307.10550·cs.SD·July 21, 2023

SC VALL-E: Style-Controllable Zero-Shot Text to Speech Synthesizer

Daegyeom Kim, Seongho Hong, and Yong-Hoon Choi

PDF

Open Access 1 Repo

TL;DR

This paper introduces SC VALL-E, a neural codec language model that enables controllable, expressive zero-shot text-to-speech synthesis by manipulating style attributes like emotion and pitch, outperforming existing models.

Contribution

The paper presents a novel style control mechanism within VALL-E, allowing for explicit manipulation of speech attributes in zero-shot synthesis, which was not possible with prior models.

Findings

01

SC VALL-E achieves competitive quality in expressive speech synthesis.

02

The model effectively controls style attributes such as emotion and pitch.

03

It outperforms baseline models in style control and speech quality metrics.

Abstract

Expressive speech synthesis models are trained by adding corpora with diverse speakers, various emotions, and different speaking styles to the dataset, in order to control various characteristics of speech and generate the desired voice. In this paper, we propose a style control (SC) VALL-E model based on the neural codec language model (called VALL-E), which follows the structure of the generative pretrained transformer 3 (GPT-3). The proposed SC VALL-E takes input from text sentences and prompt audio and is designed to generate controllable speech by not simply mimicking the characteristics of the prompt audio but by controlling the attributes to produce diverse voices. We identify tokens in the style embedding matrix of the newly designed style network that represent attributes such as emotion, speaking rate, pitch, and voice intensity, and design a model that can control these…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

0913ktg/sc_vall-e
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing