On Prosody Modeling for ASR+TTS based Voice Conversion

Wen-Chin Huang; Tomoki Hayashi; Xinjian Li; Shinji Watanabe; Tomoki; Toda

arXiv:2107.09477·cs.SD·July 21, 2021·1 cites

On Prosody Modeling for ASR+TTS based Voice Conversion

Wen-Chin Huang, Tomoki Hayashi, Xinjian Li, Shinji Watanabe, Tomoki, Toda

PDF

Open Access

TL;DR

This paper improves voice conversion by directly predicting target speaker prosody from linguistic features, enhancing naturalness and similarity in ASR+TTS based systems.

Contribution

It introduces target text prediction (TTP) for prosody modeling, addressing speaker mismatch issues in existing methods.

Findings

01

TTP outperforms source prosody transfer in evaluations

02

Effective across different linguistic representations

03

Enhances speech naturalness and conversion similarity

Abstract

In voice conversion (VC), an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents; these are then used as input by a text-to-speech (TTS) system to generate the converted speech. Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity. Although some researchers have considered transferring prosodic clues from the source speech, there arises a speaker mismatch during training and conversion. To address this issue, in this work, we propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP). We evaluate both methods on the VCC2020…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Voice and Speech Disorders