Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE
Ziang Long, Yunling Zheng, Meng Yu, Jack Xin

TL;DR
This paper enhances VAE-based zero-shot many-to-many voice conversion by integrating self-attention and regularization techniques, leading to improved speaker classification accuracy and voice quality on unseen speakers.
Contribution
It introduces a novel combination of self-attention placement and relaxed group-wise splitting regularization to improve VAE performance in voice conversion.
Findings
28.3% increase in speaker classification accuracy on unseen speakers
Slight improvement in voice quality as per MOSNet scores
Effective incorporation of non-local information in VAE decoder
Abstract
Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker's identity. We applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
