A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement   of Neural Post-filter for Low-cost Text-to-speech System

Yi-Chiao Wu; Patrick Lumban Tobing; Kazuki Yasuhara; Noriyuki; Matsunaga; Yamato Ohtani; Tomoki Toda

arXiv:2207.05913·eess.AS·September 28, 2022

A Cyclical Approach to Synthetic and Natural Speech Mismatch Refinement of Neural Post-filter for Low-cost Text-to-speech System

Yi-Chiao Wu, Patrick Lumban Tobing, Kazuki Yasuhara, Noriyuki, Matsunaga, Yamato Ohtani, Tomoki Toda

PDF

Open Access

TL;DR

This paper introduces a cyclical framework for refining neural post-filters in low-cost TTS systems, effectively addressing acoustic and temporal mismatches using neural vocoders, with demonstrated improvements through evaluations.

Contribution

It proposes a novel cyclical approach to improve neural post-filters for low-cost TTS by tackling acoustic and temporal mismatches with neural vocoders.

Findings

01

Effective reduction of acoustic mismatch.

02

Improved speech naturalness in subjective tests.

03

Framework applicable to low-resource TTS systems.

Abstract

Neural-based text-to-speech (TTS) systems achieve very high-fidelity speech generation because of the rapid neural network developments. However, the huge labeled corpus and high computation cost requirements limit the possibility of developing a high-fidelity TTS system by small companies or individuals. On the other hand, a neural vocoder, which has been widely adopted for the speech generation in neural-based TTS systems, can be trained with a relatively small unlabeled corpus. Therefore, in this paper, we explore a general framework to develop a neural post-filter (NPF) for low-cost TTS systems using neural vocoders. A cyclical approach is proposed to tackle the acoustic and temporal mismatches (AM and TM) of developing an NPF. Both objective and subjective evaluations have been conducted to demonstrate the AM and TM problems and the effectiveness of the proposed framework.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis

MethodsAttention Model