CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot   Text-to-Speech

Jaehyeon Kim; Keon Lee; Seungjun Chung; Jaewoong Cho

arXiv:2404.02781·eess.AS·April 4, 2024·2 cites

CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech

Jaehyeon Kim, Keon Lee, Seungjun Chung, Jaewoong Cho

PDF

Open Access

TL;DR

CLaM-TTS introduces a probabilistic residual vector quantization approach to improve neural codec-based zero-shot TTS, reducing token sequence length and enabling efficient multi-token generation, achieving competitive naturalness and speed.

Contribution

It presents a novel probabilistic residual vector quantization method that enhances compression and multi-token generation in neural codec-based TTS models.

Findings

01

Outperforms or matches state-of-the-art in naturalness and intelligibility.

02

Reduces inference time compared to existing models.

03

Shows impact of pretraining and tokenization strategies on performance.

Abstract

With the emergence of neural audio codecs, which encode multiple streams of discrete tokens from audio, large language models have recently gained attention as a promising approach for zero-shot Text-to-Speech (TTS) synthesis. Despite the ongoing rush towards scaling paradigms, audio tokenization ironically amplifies the scalability challenge, stemming from its long sequence length and the complexity of modelling the multiple sequences. To mitigate these issues, we present CLaM-TTS that employs a probabilistic residual vector quantization to (1) achieve superior compression in the token length, and (2) allow a language model to generate multiple tokens at once, thereby eliminating the need for cascaded modeling to handle the number of token streams. Our experimental results demonstrate that CLaM-TTS is better than or comparable to state-of-the-art neural codec-based TTS models regarding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling