Improving robustness of spontaneous speech synthesis with linguistic   speech regularization and pseudo-filled-pause insertion

Yuta Matsunaga; Takaaki Saeki; Shinnosuke Takamichi; and Hiroshi; Saruwatari

arXiv:2210.09815·cs.SD·September 20, 2023

Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion

Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi, Saruwatari

PDF

Open Access

TL;DR

This paper introduces a training method for spontaneous speech synthesis that enhances robustness and naturalness by using linguistic regularization and pseudo-filled-pause insertion, effectively handling diverse disfluencies.

Contribution

It proposes a novel regularization approach combined with pseudo-FP sampling to improve the synthesis of spontaneous speech with disfluencies.

Findings

01

Improved naturalness scores for synthetic speech with ground-truth FPs by 0.24.

02

Enhanced robustness to predicted FPs with a 0.26 increase in naturalness.

03

Effective stabilization of linguistic element synthesis in spontaneous speech.

Abstract

We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques