A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with   Background Music

Hanbin Bae; Jae-Sung Bae; Young-Sun Joo; Young-Ik Kim; Hoon-Young Cho

arXiv:2103.03049·eess.AS·March 5, 2021

A Neural Text-to-Speech Model Utilizing Broadcast Data Mixed with Background Music

Hanbin Bae, Jae-Sung Bae, Young-Sun Joo, Young-Ik Kim, Hoon-Young Cho

PDF

Open Access

TL;DR

This paper presents a novel neural TTS approach that effectively trains on broadcast data mixed with background music by removing music and using a quality classifier to improve speech synthesis quality.

Contribution

It introduces a method combining music filtering and a quality classifier within a GST-TTS model to enable high-quality TTS training with limited clean speech data.

Findings

01

Synthesized speech with higher quality than conventional methods.

02

Effective removal of background music improves TTS training.

03

Quality classifier enhances focus on speech quality in embeddings.

Abstract

Recently, it has become easier to obtain speech data from various media such as the internet or YouTube, but directly utilizing them to train a neural text-to-speech (TTS) model is difficult. The proportion of clean speech is insufficient and the remainder includes background music. Even with the global style token (GST). Therefore, we propose the following method to successfully train an end-to-end TTS model with limited broadcast data. First, the background music is removed from the speech by introducing a music filter. Second, the GST-TTS model with an auxiliary quality classifier is trained with the filtered speech and a small amount of clean speech. In particular, the quality classifier makes the embedding vector of the GST layer focus on representing the speech quality (filtered or clean) of the input speech. The experimental results verified that the proposed method synthesized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing