BASE TTS: Lessons from building a billion-parameter Text-to-Speech model   on 100K hours of data

Mateusz {\L}ajszczak; Guillermo C\'ambara; Yang Li; Fatih Beyhan,; Arent van Korlaar; Fan Yang; Arnaud Joly; \'Alvaro Mart\'in-Cortinas; Ammar; Abbas; Adam Michalski; Alexis Moinet; Sri Karlapati; Ewa Muszy\'nska; Haohan; Guo; Bartosz Putrycz; Soledad L\'opez Gambino; Kayeon Yoo; Elena Sokolova,; Thomas Drugman

arXiv:2402.08093·cs.LG·February 16, 2024·23 cites

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Mateusz {\L}ajszczak, Guillermo C\'ambara, Yang Li, Fatih Beyhan,, Arent van Korlaar, Fan Yang, Arnaud Joly, \'Alvaro Mart\'in-Cortinas, Ammar, Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszy\'nska, Haohan, Guo, Bartosz Putrycz, Soledad L\'opez Gambino, Kayeon Yoo

PDF

Open Access 1 Models

TL;DR

BASE TTS is a large-scale, 1-billion-parameter text-to-speech model trained on 100K hours of data, achieving state-of-the-art naturalness and demonstrating emergent abilities like natural prosody on complex sentences.

Contribution

We introduce BASE TTS, the largest TTS model to date with novel speech tokenization and demonstrate emergent abilities in large-scale TTS models.

Findings

01

Achieved state-of-the-art speech naturalness.

02

Large models show emergent abilities like natural prosody.

03

Developed a new dataset to measure emergent TTS abilities.

Abstract

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $B$ ig $A$ daptive $S$ treamable TTS with $E$ mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
shb777/csm-maya-exp2
model· ♡ 6
♡ 6

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding · Linear Layer · Balanced Selection · Dropout