Voice Conversion from Non-parallel Corpora Using Variational Auto-encoder
Chin-Cheng Hsu, Hsin-Te Hwang, Yi-Chiao Wu, Yu Tsao, Hsin-Min Wang

TL;DR
This paper introduces a variational auto-encoder based spectral conversion framework that effectively utilizes non-parallel corpora, overcoming the limitations of previous methods requiring aligned data.
Contribution
It presents a novel VAE-based spectral conversion approach that eliminates the need for parallel corpora or phonetic alignments, broadening practical applications.
Findings
Achieves comparable spectral conversion quality without aligned data.
Demonstrates effectiveness through objective and subjective evaluations.
Outperforms traditional methods requiring parallel corpora.
Abstract
We propose a flexible framework for spectral conversion (SC) that facilitates training with unaligned corpora. Many SC frameworks require parallel corpora, phonetic alignments, or explicit frame-wise correspondence for learning conversion functions or for synthesizing a target spectrum with the aid of alignments. However, these requirements gravely limit the scope of practical applications of SC due to scarcity or even unavailability of parallel corpora. We propose an SC framework based on variational auto-encoder which enables us to exploit non-parallel corpora. The framework comprises an encoder that learns speaker-independent phonetic representations and a decoder that learns to reconstruct the designated speaker. It removes the requirement of parallel corpora or phonetic alignments to train a spectral conversion system. We report objective and subjective evaluations to validate our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
