DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level   and Utterance-Level Acoustic Representation Learning

Takaaki Saeki; Kentaro Tachibana; Ryuichi Yamamoto

arXiv:2203.15683·cs.SD·June 30, 2022

DRSpeech: Degradation-Robust Text-to-Speech Synthesis with Frame-Level and Utterance-Level Acoustic Representation Learning

Takaaki Saeki, Kentaro Tachibana, Ryuichi Yamamoto

PDF

Open Access

TL;DR

DRSpeech introduces a novel noise-robust TTS approach that effectively handles both additive noise and environmental distortions by joint frame-level and utterance-level acoustic representation learning, leading to higher-quality speech synthesis.

Contribution

The paper presents a degradation-robust TTS framework with a new regularization technique for disentangling environmental embeddings from linguistic and speaker information.

Findings

01

Significantly improved speech quality in noisy and reverberant conditions.

02

Effective joint modeling of time-variant and time-invariant noises.

03

Outperforms previous noise-robust TTS methods.

Abstract

Most text-to-speech (TTS) methods use high-quality speech corpora recorded in a well-designed environment, incurring a high cost for data collection. To solve this problem, existing noise-robust TTS methods are intended to use noisy speech corpora as training data. However, they only address either time-invariant or time-variant noises. We propose a degradation-robust TTS method, which can be trained on speech corpora that contain both additive noises and environmental distortions. It jointly represents the time-variant additive noises with a frame-level encoder and the time-invariant environmental distortions with an utterance-level encoder. We also propose a regularization method to attain clean environmental embedding that is disentangled from the utterance-dependent information such as linguistic contents and speaker characteristics. Evaluation results show that our method achieved…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems