End-to-End Voice Conversion with Information Perturbation
Qicong Xie, Shan Yang, Yi Lei, Lei Xie, Dan Su

TL;DR
This paper introduces an end-to-end voice conversion method that uses information perturbation and specialized encoders to improve naturalness, speaker similarity, and prosody transfer in converted speech.
Contribution
It proposes a novel end-to-end framework with information perturbation and a speaker-related pitch encoder for high-quality voice conversion.
Findings
Outperforms state-of-the-art models in naturalness and speaker similarity
Effectively transfers source prosody and maintains target speaker timbre
Enhances speech intelligibility and quality
Abstract
The ideal goal of voice conversion is to convert the source speaker's speech to sound naturally like the target speaker while maintaining the linguistic content and the prosody of the source speech. However, current approaches are insufficient to achieve comprehensive source prosody transfer and target speaker timbre preservation in the converted speech, and the quality of the converted speech is also unsatisfied due to the mismatch between the acoustic model and the vocoder. In this paper, we leverage the recent advances in information perturbation and propose a fully end-to-end approach to conduct high-quality voice conversion. We first adopt information perturbation to remove speaker-related information in the source speech to disentangle speaker timbre and linguistic content and thus the linguistic information is subsequently modeled by a content encoder. To better transfer the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
