Disentanglement of Emotional Style and Speaker Identity for Expressive   Voice Conversion

Zongyang Du; Berrak Sisman; Kun Zhou; Haizhou Li

arXiv:2110.10326·eess.AS·July 22, 2022

Disentanglement of Emotional Style and Speaker Identity for Expressive Voice Conversion

Zongyang Du, Berrak Sisman, Kun Zhou, Haizhou Li

PDF

Open Access

TL;DR

This paper introduces StyleVC, a novel voice conversion framework that disentangles linguistic content, speaker identity, pitch, and emotional style, enabling expressive voice conversion across arbitrary speakers with validated effectiveness.

Contribution

The paper presents a new VAE-based framework, StyleVC, for disentangling multiple speech attributes for expressive voice conversion, addressing hierarchical emotion structure challenges.

Findings

01

Effective disentanglement of emotional style and speaker identity.

02

Successful conversion of speaker identity and emotional style for arbitrary speakers.

03

Validated improvements in objective and subjective evaluations.

Abstract

Expressive voice conversion performs identity conversion for emotional speakers by jointly converting speaker identity and emotional style. Due to the hierarchical structure of speech emotion, it is challenging to disentangle the emotional style for different speakers. Inspired by the recent success of speaker disentanglement with variational autoencoder (VAE), we propose an any-to-any expressive voice conversion framework, that is called StyleVC. StyleVC is designed to disentangle linguistic content, speaker identity, pitch, and emotional style information. We study the use of style encoder to model emotional style explicitly. At run-time, StyleVC converts both speaker identity and emotional style for arbitrary speakers. Experiments validate the effectiveness of our proposed framework in both objective and subjective evaluations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing