Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned   Speech

Mateusz Czy\.znikiewicz; {\L}ukasz Bondaruk; Jakub Kubiak; Adam; Wi\k{a}cek; {\L}ukasz Deg\'orski; Marek Kubis; Pawe{\l} Sk\'orzewski

arXiv:2406.07090·eess.AS·February 12, 2025·FedCSIS

Spoken Language Corpora Augmentation with Domain-Specific Voice-Cloned Speech

Mateusz Czy\.znikiewicz, {\L}ukasz Bondaruk, Jakub Kubiak, Adam, Wi\k{a}cek, {\L}ukasz Deg\'orski, Marek Kubis, Pawe{\l} Sk\'orzewski

PDF

TL;DR

This study investigates how augmenting spoken language datasets with synthetic, domain-specific voice-cloned speech affects speech recognition performance, highlighting the benefits of high-variability synthetic data over low-variability data.

Contribution

It demonstrates that high-variability synthetic speech, generated via voice cloning, significantly improves speech recognition accuracy compared to low-variability synthetic data.

Findings

01

High-variability synthetic speech enhances ASR performance.

02

Low-variability synthetic data quickly saturates in effectiveness.

03

Voice cloning with multiple voices outperforms conventional TTS in augmentation.

Abstract

In this paper we study the impact of augmenting spoken language corpora with domain-specific synthetic samples for the purpose of training a speech recognition system. Using both a conventional neural TTS system and a zero-shot one with voice cloning ability we generate speech corpora that vary in the number of voices. We compare speech recognition models trained with addition of different amounts of synthetic data generated using these two methods with a baseline model trained solely on voice recordings. We show that while the quality of voice-cloned dataset is lower, its increased multivoiceity makes it much more effective than the one with only a few voices synthesized with the use of a conventional neural TTS system. Furthermore, our experiments indicate that using low variability synthetic speech quickly leads to saturation in the quality of the ASR whereas high variability speech…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.