StarGAN-VC++: Towards Emotion Preserving Voice Conversion Using Deep Embeddings
Arnab Das, Suhita Ghosh, Tim Polzehl, Sebastian Stober

TL;DR
This paper enhances voice conversion by introducing emotion-aware techniques to better preserve emotional content, addressing limitations of previous GAN-based methods like StarGANv2-VC.
Contribution
It proposes novel emotion-aware loss functions and an unsupervised approach to disentangle speaker and emotion representations in voice conversion.
Findings
Improved emotion preservation in voice conversion
Effective disentanglement of speaker and emotion features
Robust performance across diverse datasets and emotions
Abstract
Voice conversion (VC) transforms an utterance to sound like another person without changing the linguistic content. A recently proposed generative adversarial network-based VC method, StarGANv2-VC is very successful in generating natural-sounding conversions. However, the method fails to preserve the emotion of the source speaker in the converted samples. Emotion preservation is necessary for natural human-computer interaction. In this paper, we show that StarGANv2-VC fails to disentangle the speaker and emotion representations, pertinent to preserve emotion. Specifically, there is an emotion leakage from the reference audio used to capture the speaker embeddings while training. To counter the problem, we propose novel emotion-aware losses and an unsupervised method which exploits emotion supervision through latent emotion representations. The objective and subjective evaluations prove…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Voice and Speech Disorders · Speech and Audio Processing
