Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec   Language Modeling

Ziqiang Zhang; Long Zhou; Chengyi Wang; Sanyuan Chen; Yu Wu; Shujie; Liu; Zhuo Chen; Yanqing Liu; Huaming Wang; Jinyu Li; Lei He; Sheng Zhao; Furu; Wei

arXiv:2303.03926·cs.CL·March 8, 2023·25 cites

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling

Ziqiang Zhang, Long Zhou, Chengyi Wang, Sanyuan Chen, Yu Wu, Shujie, Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu, Wei

PDF

Open Access 1 Repo

TL;DR

VALL-E X is a multi-lingual neural codec model enabling zero-shot cross-lingual speech synthesis and translation, producing high-quality, speaker- and emotion-preserving speech in target languages from minimal source speech prompts.

Contribution

It extends VALL-E to a multi-lingual setting, enabling zero-shot cross-lingual speech synthesis and translation with controllable accents and preserved speaker characteristics.

Findings

01

High-quality cross-lingual speech synthesis achieved from a single source utterance.

02

Effective reduction of foreign accent issues through language ID control.

03

Preservation of speaker voice, emotion, and environment in generated speech.

Abstract

We propose a cross-lingual neural codec language model, VALL-E X, for cross-lingual speech synthesis. Specifically, we extend VALL-E and train a multi-lingual conditional codec language model to predict the acoustic token sequences of the target language speech by using both the source language speech and the target language text as prompts. VALL-E X inherits strong in-context learning capabilities and can be applied for zero-shot cross-lingual text-to-speech synthesis and zero-shot speech-to-speech translation tasks. Experimental results show that it can generate high-quality speech in the target language via just one speech utterance in the source language as a prompt while preserving the unseen speaker's voice, emotion, and acoustic environment. Moreover, VALL-E X effectively alleviates the foreign accent problems, which can be controlled by a language ID. Audio samples are available…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

plachtaa/vall-e-x
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Topic Modeling · Natural Language Processing Techniques