Rep2wav: Noise Robust text-to-speech Using self-supervised   representations

Qiushi Zhu; Yu Gu; Rilin Chen; Chao Weng; Yuchen Hu; Lirong Dai; Jie; Zhang

arXiv:2308.14553·eess.AS·September 6, 2023·1 cites

Rep2wav: Noise Robust text-to-speech Using self-supervised representations

Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, Lirong Dai, Jie, Zhang

PDF

Open Access

TL;DR

This paper introduces Rep2wav, a noise-robust TTS system leveraging self-supervised pre-trained speech representations to improve synthesis quality in noisy conditions, outperforming traditional speech enhancement methods.

Contribution

It proposes a novel TTS framework that maps text to pre-trained representations and then to waveforms, enhancing noise robustness without relying on speech enhancement.

Findings

01

Outperforms speech enhancement-based methods in subjective quality

02

Demonstrates superior noise robustness on LJSpeech and LibriTTS datasets

03

Utilizes self-supervised models for improved noise tolerance

Abstract

Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HiFi-GAN, we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders

MethodsHiFi-GAN