ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided   Sequence Reordering

Yakun Song; Zhuo Chen; Xiaofei Wang; Ziyang Ma; Xie Chen

arXiv:2401.07333·cs.CL·January 17, 2024·2 cites

ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering

Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, Xie Chen

PDF

Open Access 1 Video

TL;DR

ELLA-V introduces an alignment-guided sequence reordering method for neural codec language modeling, improving zero-shot text-to-speech synthesis by enhancing stability, control, and accuracy over previous models like VALL-E.

Contribution

The paper proposes a novel sequence interleaving approach that aligns acoustic and phoneme tokens, enabling fine-grained control and more stable speech synthesis in neural codec language models.

Findings

01

Outperforms VALL-E in accuracy.

02

Provides more stable synthesis with greedy and sampling decoding.

03

Enables phoneme-level control over generated speech.

Abstract

The language model (LM) approach based on acoustic and linguistic prompts, such as VALL-E, has achieved remarkable progress in the field of zero-shot audio generation. However, existing methods still have some limitations: 1) repetitions, transpositions, and omissions in the output synthesized speech due to limited alignment constraints between audio and phoneme tokens; 2) challenges of fine-grained control over the synthesized speech with autoregressive (AR) language model; 3) infinite silence generation due to the nature of AR-based decoding, especially under the greedy strategy. To alleviate these issues, we propose ELLA-V, a simple but efficient LM-based zero-shot text-to-speech (TTS) framework, which enables fine-grained control over synthesized audio at the phoneme level. The key to ELLA-V is interleaving sequences of acoustic and phoneme tokens, where phoneme tokens appear ahead…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

ELLA-V: Stable Neural Codec Language Modeling with Alignment-Guided Sequence Reordering· underline

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing