Towards Natural and Controllable Cross-Lingual Voice Conversion Based on Neural TTS Model and Phonetic Posteriorgram
Shengkui Zhao, Hao Wang, Trung Hieu Nguyen, Bin Ma

TL;DR
This paper introduces FastSpeech-VC, a cross-lingual voice conversion framework that leverages phonetic posteriorgrams and prosodic features to produce natural, high-quality speech with controllable rate, even with limited training data.
Contribution
The paper proposes a novel cross-lingual VC method based on neural TTS and PPGs, improving naturalness, controllability, and adaptation with limited data.
Findings
Achieves high MOS scores close to professional recordings.
Faster inference speed compared to baselines.
Effective adaptation with limited training data.
Abstract
Cross-lingual voice conversion (VC) is an important and challenging problem due to significant mismatches of the phonetic set and the speech prosody of different languages. In this paper, we build upon the neural text-to-speech (TTS) model, i.e., FastSpeech, and LPCNet neural vocoder to design a new cross-lingual VC framework named FastSpeech-VC. We address the mismatches of the phonetic set and the speech prosody by applying Phonetic PosteriorGrams (PPGs), which have been proved to bridge across speaker and language boundaries. Moreover, we add normalized logarithm-scale fundamental frequency (Log-F0) to further compensate for the prosodic mismatches and significantly improve naturalness. Our experiments on English and Mandarin languages demonstrate that with only mono-lingual corpus, the proposed FastSpeech-VC can achieve high quality converted speech with mean opinion score (MOS)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques
MethodsLinear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Attention Is All You Need · Softmax · Label Smoothing · Dense Connections · Dropout · Byte Pair Encoding · Layer Normalization
