Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention   VAE

Ziang Long; Yunling Zheng; Meng Yu; Jack Xin

arXiv:2203.16037·cs.SD·August 23, 2022

Enhancing Zero-Shot Many to Many Voice Conversion with Self-Attention VAE

Ziang Long, Yunling Zheng, Meng Yu, Jack Xin

PDF

Open Access

TL;DR

This paper enhances VAE-based zero-shot many-to-many voice conversion by integrating self-attention and regularization techniques, leading to improved speaker classification accuracy and voice quality on unseen speakers.

Contribution

It introduces a novel combination of self-attention placement and relaxed group-wise splitting regularization to improve VAE performance in voice conversion.

Findings

01

28.3% increase in speaker classification accuracy on unseen speakers

02

Slight improvement in voice quality as per MOSNet scores

03

Effective incorporation of non-local information in VAE decoder

Abstract

Variational auto-encoder (VAE) is an effective neural network architecture to disentangle a speech utterance into speaker identity and linguistic content latent embeddings, then generate an utterance for a target speaker from that of a source speaker. This is possible by concatenating the identity embedding of the target speaker and the content embedding of the source speaker uttering a desired sentence. In this work, we propose to improve VAE models with self-attention and structural regularization (RGSM). Specifically, we found a suitable location of VAE's decoder to add a self-attention layer for incorporating non-local information in generating a converted utterance and hiding the source speaker's identity. We applied relaxed group-wise splitting method (RGSM) to regularize network weights and remarkably enhance generalization performance. In experiments of zero-shot many-to-many…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing