QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via   Vector-Quantized Self-Supervised Speech Representation Learning

Haohan Guo; Fenglong Xie; Jiawen Kang; Yujia Xiao; Xixin Wu; Helen; Meng

arXiv:2309.00126·cs.SD·September 4, 2023

QS-TTS: Towards Semi-Supervised Text-to-Speech Synthesis via Vector-Quantized Self-Supervised Speech Representation Learning

Haohan Guo, Fenglong Xie, Jiawen Kang, Yujia Xiao, Xixin Wu, Helen, Meng

PDF

Open Access 1 Repo

TL;DR

QS-TTS introduces a semi-supervised TTS framework leveraging vector-quantized self-supervised speech representations, significantly enhancing speech synthesis quality with less labeled data, especially in low-resource scenarios.

Contribution

The paper presents a novel semi-supervised TTS approach using dual VQ-S3R learners, improving synthesis quality and reducing supervised data needs compared to prior methods.

Findings

01

Achieved highest MOS scores in low-resource scenarios.

02

Demonstrated superior audio quality and intelligibility metrics.

03

Showed slower quality decay with less supervised data.

Abstract

This paper proposes a novel semi-supervised TTS framework, QS-TTS, to improve TTS quality with lower supervised data requirements via Vector-Quantized Self-Supervised Speech Representation Learning (VQ-S3RL) utilizing more unlabeled speech audio. This framework comprises two VQ-S3R learners: first, the principal learner aims to provide a generative Multi-Stage Multi-Codebook (MSMC) VQ-S3R via the MSMC-VQ-GAN combined with the contrastive S3RL, while decoding it back to the high-quality audio; then, the associate learner further abstracts the MSMC representation into a highly-compact VQ representation through a VQ-VAE. These two generative VQ-S3R learners provide profitable speech representations and pre-trained models for TTS, significantly improving synthesis quality with the lower requirement for supervised data. QS-TTS is evaluated comprehensively under various scenarios via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hhguo/msmc-tts
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing

MethodsVQ-VAE