VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech
Kun Zhou, Berrak Sisman, Haizhou Li

TL;DR
This paper introduces a novel VAW-GAN-based framework for emotional voice conversion that effectively disentangles and recomposes emotional elements in speech, improving emotional expressiveness while preserving content and speaker identity.
Contribution
It proposes a dual VAW-GAN pipeline for spectrum and prosody conversion, enabling better emotional element disentanglement and recomposition in speech.
Findings
Effective emotional voice conversion demonstrated in objective metrics.
Subjective evaluations show improved emotional expressiveness.
Framework preserves linguistic content and speaker identity.
Abstract
Emotional voice conversion (EVC) aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. In this paper, we study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN). We propose a speaker-dependent EVC framework based on VAW-GAN, that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion. We train a spectral encoder that disentangles emotion and prosody (F0) information from spectral features; we also train a prosodic encoder that disentangles emotion modulation of prosody (affective prosody) from linguistic prosody. At run-time, the decoder of spectral VAW-GAN is conditioned on the output of prosodic VAW-GAN. The vocoder takes the converted spectral and prosodic features to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
