ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen

TL;DR
ELLA-V introduces an alignment-guided sequence reordering method for neural codec language modeling, improving zero-shot text-to-speech synthesis by enhancing stability, control, and accuracy over previous models like VALL-E.
Contribution
The paper proposes a novel sequence interleaving approach that aligns acoustic and phoneme tokens, enabling fine-grained control and more stable speech synthesis in neural codec language models.
Findings
Outperforms VALL-E in accuracy.
Provides more stable synthesis with greedy and sampling decoding.
Enables phoneme-level control over generated speech.
Abstract
The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing
