DenoiSpeech: Denoising Text to Speech with Frame-Level Noise Modeling
Chen Zhang, Yi Ren, Xu Tan, Jinglin Liu, Kejun Zhang, Tao Qin, Sheng, Zhao, Tie-Yan Liu

TL;DR
DenoiSpeech is a novel TTS system that effectively synthesizes clean speech from noisy data by modeling frame-level noise, outperforming previous denoising approaches in real-world scenarios.
Contribution
The paper introduces a frame-level noise modeling approach integrated with TTS training, enabling high-quality speech synthesis from noisy data.
Findings
Outperforms previous methods by 0.31 and 0.66 MOS on real-world data.
Successfully models real-world complex noise at the frame level.
Demonstrates robustness in noisy speech synthesis scenarios.
Abstract
While neural-based text to speech (TTS) models can synthesize natural and intelligible voice, they usually require high-quality speech data, which is costly to collect. In many scenarios, only noisy speech of a target speaker is available, which presents challenges for TTS model training for this speaker. Previous works usually address the challenge using two methods: 1) training the TTS model using the speech denoised with an enhancement model; 2) taking a single noise embedding as input when training with noisy speech. However, they usually cannot handle speech with real-world complicated noise such as those with high variations along time. In this paper, we develop DenoiSpeech, a TTS system that can synthesize clean speech for a speaker with noisy speech data. In DenoiSpeech, we handle real-world noisy speech by modeling the fine-grained frame-level noise with a noise condition…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
