Emotional Voice Conversion using Multitask Learning with Text-to-speech

Tae-Ho Kim; Sungjae Cho; Shinkook Choi; Sejik Park; Soo-Young Lee

arXiv:1911.06149·eess.AS·November 28, 2019·1 cites

Emotional Voice Conversion using Multitask Learning with Text-to-speech

Tae-Ho Kim, Sungjae Cho, Shinkook Choi, Sejik Park, Soo-Young Lee

PDF

Open Access 1 Repo

TL;DR

This paper introduces a multitask learning approach combining voice conversion and text-to-speech models to better preserve linguistic content and emotional tone in voice transformation tasks.

Contribution

It proposes a novel multitask learning framework that leverages TTS embeddings to improve emotional voice conversion without explicit alignment.

Findings

01

Multitask learning improves linguistic content preservation in VC.

02

The model effectively captures emotional nuances in voice conversion.

03

Experimental results show enhanced stability and quality in VC.

Abstract

Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ktho22/vctts
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence