Learning Disentangled Phone and Speaker Representations in a   Semi-Supervised VQ-VAE Paradigm

Jennifer Williams; Yi Zhao; Erica Cooper; Junichi Yamagishi

arXiv:2010.10727·eess.AS·February 11, 2021

Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

Jennifer Williams, Yi Zhao, Erica Cooper, Junichi Yamagishi

PDF

Open Access 1 Repo

TL;DR

This paper introduces a semi-supervised VQ-VAE model with a dedicated speaker encoder and codebook to improve disentanglement of speaker identity and phonetic content in speech synthesis, enhancing generalization and interpretability.

Contribution

It proposes a novel VQ-VAE architecture with separate speaker and phone codebooks, and compares training methods to improve speech synthesis and speaker diarization.

Findings

01

Adding a speaker VQ codebook improves speech synthesis quality metrics.

02

Speaker codebook indices outperform x-vector baseline in diarization.

03

Phones are better recognized from sub-phone codebooks in semi-supervised setting.

Abstract

We present a new approach to disentangle speaker voice and phone content by introducing new components to the VQ-VAE architecture for speech synthesis. The original VQ-VAE does not generalize well to unseen speakers or content. To alleviate this problem, we have incorporated a speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub-phone codebooks. We also compare two training methods: self-supervised with global conditions and semi-supervised with speaker labels. Adding a speaker VQ component improves objective measures of speech synthesis quality (estimated MOS, speaker similarity, ASR-based intelligibility) and provides learned representations that are meaningful. Our speaker VQ codebook indices can be used in a simple speaker diarization task and perform slightly better than an x-vector baseline. Additionally, phones…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rhoposit/icassp2021
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Natural Language Processing Techniques

MethodsVQ-VAE