Taco-VC: A Single Speaker Tacotron based Voice Conversion with Limited Data
Roee Levy Leshem, Raja Giryes

TL;DR
Taco-VC is a voice conversion system that uses a single speaker Tacotron model and minimal data to produce high-quality speech, outperforming some baselines and requiring less resources.
Contribution
The paper presents Taco-VC, a novel single-speaker Tacotron-based voice conversion architecture that adapts with limited data, reducing resource requirements compared to multi-speaker systems.
Findings
Outperforms baseline in VCC 2018 SPOKE task
Achieves competitive results with less data
Uses a speech enhancement network to improve quality
Abstract
This paper introduces Taco-VC, a novel architecture for voice conversion based on Tacotron synthesizer, which is a sequence-to-sequence with attention model. The training of multi-speaker voice conversion systems requires a large number of resources, both in training and corpus size. Taco-VC is implemented using a single speaker Tacotron synthesizer based on Phonetic PosteriorGrams (PPGs) and a single speaker WaveNet vocoder conditioned on mel spectrograms. To enhance the converted speech quality, and to overcome over-smoothing, the outputs of Tacotron are passed through a novel speechenhancement network, which is composed of a combination of the phoneme recognition and Tacotron networks. Our system is trained just with a single speaker corpus and adapts to new speakers using only a few minutes of training data. Using mid-size public datasets, our method outperforms the baseline in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
MethodsMixture of Logistic Distributions · Griffin-Lim Algorithm · Sigmoid Activation · Highway Layer · Residual Connection · Convolution · Batch Normalization · Max Pooling · Residual GRU · Bidirectional GRU
