Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling
Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie, Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu, Wei

TL;DR
VALL-E X is a multi-lingual neural codec model enabling zero-shot cross-lingual speech synthesis and translation, producing high-quality, speaker- and emotion-preserving speech in target languages from minimal source speech prompts.
Contribution
It extends VALL-E to a multi-lingual setting, enabling zero-shot cross-lingual speech synthesis and translation with controllable accents and preserved speaker characteristics.
Findings
High-quality cross-lingual speech synthesis achieved from a single source utterance.
Effective reduction of foreign accent issues through language ID control.
Preservation of speaker voice, emotion, and environment in generated speech.
Abstract
We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques
