Emotional Voice Conversion using Multitask Learning with Text-to-speech
Tae-Ho Kim, Sungjae Cho, Shinkook Choi, Sejik Park, Soo-Young Lee

TL;DR
This paper introduces a multitask learning approach combining voice conversion and text-to-speech models to better preserve linguistic content and emotional tone in voice transformation tasks.
Contribution
It proposes a novel multitask learning framework that leverages TTS embeddings to improve emotional voice conversion without explicit alignment.
Findings
Multitask learning improves linguistic content preservation in VC.
The model effectively captures emotional nuances in voice conversion.
Experimental results show enhanced stability and quality in VC.
Abstract
Voice conversion (VC) is a task to transform a person's voice to different style while conserving linguistic contents. Previous state-of-the-art on VC is based on sequence-to-sequence (seq2seq) model, which could mislead linguistic information. There was an attempt to overcome it by using textual supervision, it requires explicit alignment which loses the benefit of using seq2seq model. In this paper, a voice converter using multitask learning with text-to-speech (TTS) is presented. The embedding space of seq2seq-based TTS has abundant information on the text. The role of the decoder of TTS is to convert embedding space to speech, which is same to VC. In the proposed model, the whole network is trained to minimize loss of VC and TTS. VC is expected to capture more linguistic information and to preserve training stability by multitask learning. Experiments of VC were performed on a male…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsSigmoid Activation · Tanh Activation · Long Short-Term Memory · Sequence to Sequence
