Voice Reenactment with F0 and timing constraints and adversarial   learning of conversions

Frederik Bous; Laurent Benaroya; Nicolas Obin; Axel Roebel

arXiv:2110.03744·cs.SD·June 1, 2022

Voice Reenactment with F0 and timing constraints and adversarial learning of conversions

Frederik Bous, Laurent Benaroya, Nicolas Obin, Axel Roebel

PDF

Open Access

TL;DR

This paper presents a neural voice conversion system that preserves source speech prosody and expressivity by incorporating F0 and timing constraints, enhanced with adversarial learning to improve naturalness and conversion quality.

Contribution

It introduces a novel sequence-to-sequence voice conversion architecture with explicit F0 and timing control, combined with adversarial training for better expressivity preservation.

Findings

01

Prosody is effectively preserved during conversion.

02

Adversarial learning improves naturalness and conversion quality.

03

The method outperforms baseline models on the VCTK dataset.

Abstract

This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural- VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conversion. First, the S2S-VC architecture is modified so as to synchronize the converted speech with the source speech by mean of phonetic duration encoding; second, the decoder is conditioned on the desired sequence of F0- values and an explicit F0-loss is formulated between the F0 of the source speaker and the one of the converted speech. Besides, an adversarial learning of conversions is integrated within the S2S-VC architecture so as to exploit both advantages of reconstruction of original…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing