Improving robustness of spontaneous speech synthesis with linguistic speech regularization and pseudo-filled-pause insertion
Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, and Hiroshi, Saruwatari

TL;DR
This paper introduces a training method for spontaneous speech synthesis that enhances robustness and naturalness by using linguistic regularization and pseudo-filled-pause insertion, effectively handling diverse disfluencies.
Contribution
It proposes a novel regularization approach combined with pseudo-FP sampling to improve the synthesis of spontaneous speech with disfluencies.
Findings
Improved naturalness scores for synthetic speech with ground-truth FPs by 0.24.
Enhanced robustness to predicted FPs with a 0.26 increase in naturalness.
Effective stabilization of linguistic element synthesis in spontaneous speech.
Abstract
We present a training method with linguistic speech regularization that improves the robustness of spontaneous speech synthesis methods with filled pause (FP) insertion. Spontaneous speech synthesis is aimed at producing speech with human-like disfluencies, such as FPs. Because modeling the complex data distribution of spontaneous speech with a rich FP vocabulary is challenging, the quality of FP-inserted synthetic speech is often limited. To address this issue, we present a method for synthesizing spontaneous speech that improves robustness to diverse FP insertions. Regularization is used to stabilize the synthesis of the linguistic speech (i.e., non-FP) elements. To further improve robustness to diverse FP insertions, it utilizes pseudo-FPs sampled using an FP word prediction model as well as ground-truth FPs. Our experiments demonstrated that the proposed method improves the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
