Towards Low-Resource StarGAN Voice Conversion using Weight Adaptive Instance Normalization
Mingjie Chen, Yanpei Shi, Thomas Hain

TL;DR
This paper introduces a novel low-resource StarGAN-based voice conversion model that employs weight adaptive instance normalization to improve data efficiency and performance across many speakers with limited training data.
Contribution
The work proposes a new model using speaker embeddings and adaptive weight normalization to enhance many-to-many voice conversion in low-resource scenarios.
Findings
Outperforms baseline methods in objective evaluations.
Achieves higher naturalness and similarity in subjective tests.
Effective with as few as 5 samples per speaker.
Abstract
Many-to-many voice conversion with non-parallel training data has seen significant progress in recent years. StarGAN-based models have been interests of voice conversion. However, most of the StarGAN-based methods only focused on voice conversion experiments for the situations where the number of speakers was small, and the amount of training data was large. In this work, we aim at improving the data efficiency of the model and achieving a many-to-many non-parallel StarGAN-based voice conversion for a relatively large number of speakers with limited training samples. In order to improve data efficiency, the proposed model uses a speaker encoder for extracting speaker embeddings and conducts adaptive instance normalization (AdaIN) on convolutional weights. Experiments are conducted with 109 speakers under two low-resource situations, where the number of training samples is 20 and 5 per…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
