Voice Conversion with Conditional SampleRNN
Cong Zhou, Michael Horgan, Vivek Kumar, Cristina Vasco, Dan Darcy

TL;DR
This paper introduces a novel voice conversion method using a conditioned SampleRNN model that preserves speech content while changing speaker identity, enabling flexible, many-to-many voice conversion without parallel data.
Contribution
The paper presents a new conditioned SampleRNN approach for voice conversion that outperforms traditional methods and does not require parallel data.
Findings
Outperforms conventional VC methods in subjective evaluations
Enables many-to-many voice conversion without parallel data
Preserves speech content while changing speaker identity
Abstract
Here we present a novel approach to conditioning the SampleRNN generative model for voice conversion (VC). Conventional methods for VC modify the perceived speaker identity by converting between source and target acoustic features. Our approach focuses on preserving voice content and depends on the generative network to learn voice style. We first train a multi-speaker SampleRNN model conditioned on linguistic features, pitch contour, and speaker identity using a multi-speaker speech corpus. Voice-converted speech is generated using linguistic features and pitch contour extracted from the source speaker, and the target speaker identity. We demonstrate that our system is capable of many-to-many voice conversion without requiring parallel data, enabling broad applications. Subjective evaluation demonstrates that our approach outperforms conventional VC methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
