Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance
Hieu-Thi Luong, Junichi Yamagishi

TL;DR
This study explores using vector quantization in neural speech synthesis to create discrete latent spaces that maintain quality while reducing data size and enhancing privacy.
Contribution
It introduces a method for modeling latent linguistic embeddings with vector quantization, comparing it to continuous representations, and demonstrating its practical benefits.
Findings
Quantized latent spaces have similar speech quality and speaker similarity as continuous ones.
Discrete embeddings reduce bit-rate for data transfer and limit information leakage.
System shows only minor perceptual degradation with quantization.
Abstract
Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding that takes on different properties while having a similar performance in terms of quality and speaker similarity. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations, but has a discrete latent…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
