A Text-to-Speech Pipeline, Evaluation Methodology, and Initial   Fine-Tuning Results for Child Speech Synthesis

Rishabh Jain; Mariam Yiwere; Dan Bigioi; Peter Corcoran and; Horia Cucu

arXiv:2203.11562·cs.SD·April 5, 2022

A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis

Rishabh Jain, Mariam Yiwere, Dan Bigioi, Peter Corcoran and, Horia Cucu

PDF

Open Access

TL;DR

This paper presents a pipeline for fine-tuning neural TTS models to synthesize child speech, including evaluation methods and initial results demonstrating high naturalness and intelligibility.

Contribution

It introduces a transfer-learning approach for child speech synthesis using a small dataset and develops a comprehensive evaluation framework for synthetic child voices.

Findings

01

Subjective MOS scores indicate high naturalness and intelligibility.

02

Objective evaluations show strong correlation between real and synthetic child speech.

03

Synthetic speech achieves low word error rates comparable to real child speech.

Abstract

Speech synthesis has come a long way as current text-to-speech (TTS) models can now generate natural human-sounding speech. However, most of the TTS research focuses on using adult speech data and there has been very limited work done on child speech synthesis. This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets. This approach adopts a multi-speaker TTS retuning workflow to provide a transfer-learning pipeline. A publicly available child speech dataset was cleaned to provide a smaller subset of approximately 19 hours, which formed the basis of our fine-tuning experiments. Both subjective and objective evaluations were performed using a pretrained MOSNet for objective evaluation and a novel subjective framework for mean opinion score (MOS) evaluations. Subjective evaluations achieved the MOS of 3.95…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Topic Modeling