Beyond Voice Identity Conversion: Manipulating Voice Attributes by Adversarial Learning of Structured Disentangled Representations
Laurent Benaroya, Nicolas Obin, Axel Roebel

TL;DR
This paper introduces a neural architecture that manipulates voice attributes like gender and age by disentangling speech representations, enabling realistic and synchronized voice conversion beyond identity modification.
Contribution
It proposes a novel adversarial structured neural network with multiple auto-encoders for independent speech attribute manipulation, preserving timing for lip-sync applications.
Findings
High-quality voice gender conversion achieved on VCTK dataset.
Successfully learns gender-independent speech representations.
Preserves original speech timing for lip-sync applications.
Abstract
Voice conversion (VC) consists of digitally altering the voice of an individual to manipulate part of its content, primarily its identity, while maintaining the rest unchanged. Research in neural VC has accomplished considerable breakthroughs with the capacity to falsify a voice identity using a small amount of data with a highly realistic rendering. This paper goes beyond voice identity and presents a neural architecture that allows the manipulation of voice attributes (e.g., gender and age). Leveraging the latest advances on adversarial learning of structured speech representation, a novel structured neural network is proposed in which multiple auto-encoders are used to encode speech as a set of idealistically independent linguistic and extra-linguistic representations, which are learned adversariarly and can be manipulated during VC. Moreover, the proposed architecture is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Voice and Speech Disorders
