Audio-conditioned phonemic and prosodic annotation for building   text-to-speech models from unlabeled speech data

Yuma Shirahata; Byeongseon Park; Ryuichi Yamamoto; Kentaro Tachibana

arXiv:2406.08111·eess.AS·June 13, 2024·Interspeech

Audio-conditioned phonemic and prosodic annotation for building text-to-speech models from unlabeled speech data

Yuma Shirahata, Byeongseon Park, Ryuichi Yamamoto, Kentaro Tachibana

PDF

Open Access

TL;DR

This paper introduces an audio-conditioned annotation model that leverages a fine-tuned ASR system and pseudo-labeling to create labeled speech datasets from unlabeled data, enabling natural speech synthesis in TTS models.

Contribution

It presents a novel annotation approach combining fine-tuned ASR and pseudo-labeling to reduce the need for labeled data in TTS dataset creation.

Findings

01

TTS models trained with the proposed dataset achieve naturalness comparable to fully-labeled datasets.

02

The annotation method effectively leverages limited labeled data and text-only corpora.

03

Pseudo-labeling with an auxiliary TTS model enhances dataset quality for TTS training.

Abstract

This paper proposes an audio-conditioned phonemic and prosodic annotation model for building text-to-speech (TTS) datasets from unlabeled speech samples. For creating a TTS dataset that consists of label-speech paired data, the proposed annotation model leverages an automatic speech recognition (ASR) model to obtain phonemic and prosodic labels from unlabeled speech samples. By fine-tuning a large-scale pre-trained ASR model, we can construct the annotation model using a limited amount of label-speech paired data within an existing TTS dataset. To alleviate the shortage of label-speech paired data for training the annotation model, we generate pseudo label-speech paired data using text-only corpora and an auxiliary TTS model. This TTS model is also trained with the existing TTS dataset. Experimental results show that the TTS model trained with the dataset created by the proposed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems