Exploring Speech Enhancement for Low-resource Speech Synthesis

Zhaoheng Ni; Sravya Popuri; Ning Dong; Kohei Saijo; Xiaohui Zhang,; Gael Le Lan; Yangyang Shi; Vikas Chandra; Changhan Wang

arXiv:2309.10795·eess.AS·September 20, 2023·1 cites

Exploring Speech Enhancement for Low-resource Speech Synthesis

Zhaoheng Ni, Sravya Popuri, Ning Dong, Kohei Saijo, Xiaohui Zhang,, Gael Le Lan, Yangyang Shi, Vikas Chandra, Changhan Wang

PDF

Open Access

TL;DR

This paper investigates how speech enhancement models can improve low-resource speech synthesis by augmenting training data, demonstrating significant quality improvements in Arabic TTS systems and analyzing the impact of speech distortion.

Contribution

It introduces a pipeline applying TF-GridNet speech enhancement to low-resource datasets for TTS training and provides empirical analysis of its effects on performance.

Findings

01

Enhanced TTS performance on low-resource Arabic datasets.

02

Speech enhancement improves ASR WER metrics.

03

Analysis of speech distortion effects on TTS quality.

Abstract

High-quality and intelligible speech is essential to text-to-speech (TTS) model training, however, obtaining high-quality data for low-resource languages is challenging and expensive. Applying speech enhancement on Automatic Speech Recognition (ASR) corpus mitigates the issue by augmenting the training data, while how the nonlinear speech distortion brought by speech enhancement models affects TTS training still needs to be investigated. In this paper, we train a TF-GridNet speech enhancement model and apply it to low-resource datasets that were collected for the ASR task, then train a discrete unit based TTS model on the enhanced speech. We use Arabic datasets as an example and show that the proposed pipeline significantly improves the low-resource TTS system compared with other baseline methods in terms of ASR WER metric. We also run empirical analysis on the correlation between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis