Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang; Sanyuan Chen; Yu Wu; Ziqiang Zhang; Long Zhou; Shujie; Liu; Zhuo Chen; Yanqing Liu; Huaming Wang; Jinyu Li; Lei He; Sheng Zhao; Furu; Wei

arXiv:2301.02111·cs.CL·January 6, 2023·161 cites

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie, Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu, Wei

PDF

Open Access 5 Repos 4 Models 2 Datasets

TL;DR

This paper presents Vall-E, a neural codec language model for zero-shot text-to-speech synthesis that leverages large-scale training data and in-context learning to produce natural, personalized speech with minimal prompts.

Contribution

The paper introduces Vall-E, a novel neural codec language model that achieves high-quality zero-shot TTS by treating synthesis as a conditional language modeling task and scaling up training data.

Findings

01

Vall-E outperforms state-of-the-art zero-shot TTS systems in naturalness and speaker similarity.

02

It can synthesize personalized speech using only a 3-second speaker prompt.

03

Vall-E preserves speaker emotion and acoustic environment in synthesis.

Abstract

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Speech and dialogue systems

MethodsAdam · LAMB · 7 Fastest Ways to Call American Airlines Reservations Number (USA Guide)