On Prosody Modeling for ASR+TTS based Voice Conversion
Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki, Toda

TL;DR
This paper improves voice conversion by directly predicting target speaker prosody from linguistic features, enhancing naturalness and similarity in ASR+TTS based systems.
Contribution
It introduces target text prediction (TTP) for prosody modeling, addressing speaker mismatch issues in existing methods.
Findings
TTP outperforms source prosody transfer in evaluations
Effective across different linguistic representations
Enhances speech naturalness and conversion similarity
Abstract
In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. Although some researchers have considered transferring prosodic clues from the source speech, there arises a speaker mismatch during training and conversion. To address this issue, in this work, we propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP). We evaluate both methods on the VCC2020…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Voice and Speech Disorders
