A Cyclical Post-filtering Approach to Mismatch Refinement of Neural   Vocoder for Text-to-speech Systems

Yi-Chiao Wu; Patrick Lumban Tobing; Kazuki Yasuhara; Noriyuki; Matsunaga; Yamato Ohtani; Tomoki Toda

arXiv:2005.08659·eess.AS·August 10, 2020

A Cyclical Post-filtering Approach to Mismatch Refinement of Neural Vocoder for Text-to-speech Systems

Yi-Chiao Wu, Patrick Lumban Tobing, Kazuki Yasuhara, Noriyuki, Matsunaga, Yamato Ohtani, Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces a cyclical post-filtering method using voice conversion to refine neural vocoder outputs in TTS systems, effectively addressing temporal and acoustic mismatches with demonstrated improvements.

Contribution

The proposed cyclic voice conversion framework enhances neural vocoder training and testing by generating matched pseudo-data, applicable across various TTS and vocoder systems.

Findings

01

Improved speech quality confirmed by objective metrics.

02

Enhanced naturalness in subjective listening tests.

03

Framework applicable to different TTS and neural vocoders.

Abstract

Recently, the effectiveness of text-to-speech (TTS) systems combined with neural vocoders to generate high-fidelity speech has been shown. However, collecting the required training data and building these advanced systems from scratch are time and resource consuming. An economical approach is to develop a neural vocoder to enhance the speech generated by existing or low-cost TTS systems. Nonetheless, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the neural vocoders. Because of the generality, this framework can be applied to arbitrary TTS systems and neural vocoders. In this paper, we apply…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing