Voice Reenactment with F0 and timing constraints and adversarial learning of conversions
Frederik Bous, Laurent Benaroya, Nicolas Obin, Axel Roebel

TL;DR
This paper presents a neural voice conversion system that preserves source speech prosody and expressivity by incorporating F0 and timing constraints, enhanced with adversarial learning to improve naturalness and conversion quality.
Contribution
It introduces a novel sequence-to-sequence voice conversion architecture with explicit F0 and timing control, combined with adversarial training for better expressivity preservation.
Findings
Prosody is effectively preserved during conversion.
Adversarial learning improves naturalness and conversion quality.
The method outperforms baseline models on the VCTK dataset.
Abstract
This paper introduces voice reenactement as the task of voice conversion (VC) in which the expressivity of the source speaker is preserved during conversion while the identity of a target speaker is transferred. To do so, an original neural- VC architecture is proposed based on sequence-to-sequence voice conversion (S2S-VC) in which the speech prosody of the source speaker is preserved during conversion. First, the S2S-VC architecture is modified so as to synchronize the converted speech with the source speech by mean of phonetic duration encoding; second, the decoder is conditioned on the desired sequence of F0- values and an explicit F0-loss is formulated between the F0 of the source speaker and the one of the converted speech. Besides, an adversarial learning of conversions is integrated within the S2S-VC architecture so as to exploit both advantages of reconstruction of original…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
