Voice Conversion using Convolutional Neural Networks
Shariq Mobin, Joan Bruna

TL;DR
This paper explores voice conversion by transforming pitch and timbre using convolutional neural networks, aiming to better mimic individual speaker identities.
Contribution
It introduces a neural network-based approach to manipulate both pitch and timbre for voice conversion, advancing previous methods that focused mainly on pitch.
Findings
Preliminary results show promising voice conversion quality.
Neural networks can effectively learn speaker-specific features.
The approach improves speaker similarity in converted voices.
Abstract
The human auditory system is able to distinguish the vocal source of thousands of speakers, yet not much is known about what features the auditory system uses to do this. Fourier Transforms are capable of capturing the pitch and harmonic structure of the speaker but this alone proves insufficient at identifying speakers uniquely. The remaining structure, often referred to as timbre, is critical to identifying speakers but we understood little about it. In this paper we use recent advances in neural networks in order to manipulate the voice of one speaker into another by transforming not only the pitch of the speaker, but the timbre. We review generative models built with neural networks as well as architectures for creating neural networks that learn analogies. Our preliminary results converting voices from one speaker to another are encouraging.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMusic and Audio Processing · Speech and Audio Processing · Music Technology and Sound Studies
