RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting   for Text-to-Speech Synthesis

Detai Xin; Xu Tan; Kai Shen; Zeqian Ju; Dongchao Yang; Yuancheng Wang,; Shinnosuke Takamichi; Hiroshi Saruwatari; Shujie Liu; Jinyu Li; Sheng Zhao

arXiv:2404.03204·eess.AS·May 21, 2024·2 cites

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang,, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, Sheng Zhao

PDF

Open Access

TL;DR

RALL-E introduces a chain-of-thought prompting approach for text-to-speech synthesis, significantly improving robustness and reducing error rates compared to previous LLM-based methods.

Contribution

It proposes a novel chain-of-thought prompting framework that decomposes TTS into simpler steps, enhancing robustness and accuracy in zero-shot speech synthesis.

Findings

01

Significantly reduces word error rate in zero-shot TTS

02

Improves synthesis robustness for difficult sentences

03

Outperforms baseline VALL-E in objective and subjective evaluations

Abstract

We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on large language models (LLMs) shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of language models. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Multi-Head Attention · Adam · Byte Pair Encoding · Absolute Position Encodings · Softmax · Dense Connections · Label Smoothing