DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling

Chen Zhang; Yi Ren; Xu Tan; Jinglin Liu; Kejun Zhang; Tao Qin; Sheng; Zhao; Tie-Yan Liu

arXiv:2012.09547·eess.AS·December 21, 2020·6 cites

DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling

Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng, Zhao, Tie-Yan Liu

PDF

Open Access

TL;DR

DenoiSpeech is a novel TTS system that effectively synthesizes clean speech from noisy data by modeling frame-level noise, outperforming previous denoising approaches in real-world scenarios.

Contribution

The paper introduces a frame-level noise modeling approach integrated with TTS training, enabling high-quality speech synthesis from noisy data.

Findings

01

Outperforms previous methods by 0.31 and 0.66 MOS on real-world data.

02

Successfully models real-world complex noise at the frame level.

03

Demonstrates robustness in noisy speech synthesis scenarios.

Abstract

While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing