Rep2wav: Noise Robust text-to-speech Using self-supervised representations
Qiushi Zhu, Yu Gu, Rilin Chen, Chao Weng, Yuchen Hu, Lirong Dai, Jie, Zhang

TL;DR
This paper introduces Rep2wav, a noise-robust TTS system leveraging self-supervised pre-trained speech representations to improve synthesis quality in noisy conditions, outperforming traditional speech enhancement methods.
Contribution
It proposes a novel TTS framework that maps text to pre-trained representations and then to waveforms, enhancing noise robustness without relying on speech enhancement.
Findings
Outperforms speech enhancement-based methods in subjective quality
Demonstrates superior noise robustness on LJSpeech and LibriTTS datasets
Utilizes self-supervised models for improved noise tolerance
Abstract
Benefiting from the development of deep learning, text-to-speech (TTS) techniques using clean speech have achieved significant performance improvements. The data collected from real scenes often contains noise and generally needs to be denoised by speech enhancement models. Noise-robust TTS models are often trained using the enhanced speech, which thus suffer from speech distortion and background noise that affect the quality of the synthesized speech. Meanwhile, it was shown that self-supervised pre-trained models exhibit excellent noise robustness on many speech tasks, implying that the learned representation has a better tolerance for noise perturbations. In this work, we therefore explore pre-trained models to improve the noise robustness of TTS models. Based on HiFi-GAN, we first propose a representation-to-waveform vocoder, which aims to learn to map the representation of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Voice and Speech Disorders
MethodsHiFi-GAN
