NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for   Noise-robust Expressive TTS

Dongchao Yang; Songxiang Liu; Jianwei Yu; Helin Wang; Chao Weng,; Yuexian Zou

arXiv:2211.02448·cs.SD·November 7, 2022·1 cites

NoreSpeech: Knowledge Distillation based Conditional Diffusion Model for Noise-robust Expressive TTS

Dongchao Yang, Songxiang Liu, Jianwei Yu, Helin Wang, Chao Weng,, Yuexian Zou

PDF

Open Access

TL;DR

NoreSpeech is a noise-robust expressive TTS model that uses diffusion-based style learning, quantized style space, and text-style alignment to synthesize expressive speech from noisy references.

Contribution

It introduces a diffusion-based style learning method with knowledge distillation, a quantized style space, and a length-mismatched style transfer mechanism for noise-robust expressive TTS.

Findings

01

Outperforms previous models in noisy environments

02

Effectively transfers style from noisy references

03

Demonstrates strong generalization to unseen styles

Abstract

Expressive text-to-speech (TTS) can synthesize a new speaking style by imiating prosody and timbre from a reference audio, which faces the following challenges: (1) The highly dynamic prosody information in the reference audio is difficult to extract, especially, when the reference audio contains background noise. (2) The TTS systems should have good generalization for unseen speaking styles. In this paper, we present a \textbf{no}ise-\textbf{r}obust \textbf{e}xpressive TTS model (NoreSpeech), which can robustly transfer speaking style in a noisy reference utterance to synthesized speech. Specifically, our NoreSpeech includes several components: (1) a novel DiffStyle module, which leverages powerful probabilistic denoising diffusion models to learn noise-agnostic speaking style features from a teacher model by knowledge distillation; (2) a VQ-VAE block, which maps the style features…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing