Towards Natural and Controllable Cross-Lingual Voice Conversion Based on   Neural TTS Model and Phonetic Posteriorgram

Shengkui Zhao; Hao Wang; Trung Hieu Nguyen; Bin Ma

arXiv:2102.01991·cs.SD·February 4, 2021·1 cites

Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram

Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma

PDF

Open Access

TL;DR

This paper introduces FastSpeech-VC, a cross-lingual voice conversion framework that leverages phonetic posteriorgrams and prosodic features to produce natural, high-quality speech with controllable rate, even with limited training data.

Contribution

The paper proposes a novel cross-lingual VC method based on neural TTS and PPGs, improving naturalness, controllability, and adaptation with limited data.

Findings

01

Achieves high MOS scores close to professional recordings.

02

Faster inference speed compared to baselines.

03

Effective adaptation with limited training data.

Abstract

Cross-lingual voice conversion (VC) is an important and challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages. In this paper, we build upon the neural text-to-speech (TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new cross-lingual VC framework named FastSpeech-VC. We address the mismatches of the phonetic set and the speech prosody by applying Phonetic PosteriorGrams (PPGs), which have been proved to bridge across speaker and language boundaries. Moreover, we add normalized logarithm-scale fundamental frequency (Log-F0) to further compensate for the prosodic mismatches and significantly improve naturalness. Our experiments on English and Mandarin languages demonstrate that with only mono-lingual corpus, the proposed FastSpeech-VC can achieve high quality converted speech with mean opinion score (MOS)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques

MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Softmax · Label Smoothing · Dense Connections · Dropout · Byte Pair Encoding · Layer Normalization