BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
Mateusz {\L}ajszczak, Guillermo C\'ambara, Yang Li, Fatih Beyhan,, Arent van Korlaar, Fan Yang, Arnaud Joly, \'Alvaro Mart\'in-Cortinas, Ammar, Abbas, Adam Michalski, Alexis Moinet, Sri Karlapati, Ewa Muszy\'nska, Haohan, Guo, Bartosz Putrycz, Soledad L\'opez Gambino, Kayeon Yoo

TL;DR
BASE TTS is a large-scale, 1-billion-parameter text-to-speech model trained on 100K hours of data, achieving state-of-the-art naturalness and demonstrating emergent abilities like natural prosody on complex sentences.
Contribution
We introduce BASE TTS, the largest TTS model to date with novel speech tokenization and demonstrate emergent abilities in large-scale TTS models.
Findings
Achieved state-of-the-art speech naturalness.
Large models show emergent abilities like natural prosody.
Developed a new dataset to measure emergent TTS abilities.
Abstract
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for ig daptive treamable TTS with mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Dense Connections · Label Smoothing · Absolute Position Encodings · Softmax · Byte Pair Encoding · Linear Layer · Balanced Selection · Dropout
