A Vocoder-free WaveNet Voice Conversion with Non-Parallel Data
Xiaohai Tian, Eng Siong Chng, Haizhou Li

TL;DR
This paper introduces a vocoder-free WaveNet-based voice conversion method that directly maps phonetic representations to speech waveforms, improving quality and reducing errors in non-parallel data scenarios.
Contribution
It proposes a novel vocoder-free approach using WaveNet to directly convert PPGs to speech, eliminating vocoder-related quality issues and feature mismatch problems.
Findings
Significantly better speech quality than baseline methods
Effective non-parallel training with PPGs
Reduces estimation errors caused by vocoders
Abstract
In a typical voice conversion system, vocoder is commonly used for speech-to-features analysis and features-to-speech synthesis. However, vocoder can be a source of speech quality degradation. This paper presents a vocoder-free voice conversion approach using WaveNet for non-parallel training data. Instead of dealing with the intermediate features, the proposed approach utilizes the WaveNet to map the Phonetic PosteriorGrams (PPGs) to the waveform samples directly. In this way, we avoid the estimation errors caused by vocoder and feature conversion. Additionally, as PPG is assumed to be speaker independent, the proposed method also reduces the feature mismatch problem in WaveNet vocoder based approaches. Experimental results conducted on the CMU-ARCTIC database show that the proposed approach significantly outperforms the baseline approaches in terms of speech quality.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
