Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for   Personalized Spontaneous Speech Synthesis

Yuta Matsunaga; Takaaki Saeki; Shinnosuke Takamichi; Hiroshi; Saruwatari

arXiv:2210.07559·cs.SD·September 20, 2023

Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Personalized Spontaneous Speech Synthesis

Yuta Matsunaga, Takaaki Saeki, Shinnosuke Takamichi, Hiroshi, Saruwatari

PDF

Open Access 1 Repo

TL;DR

This paper empirically investigates personalized spontaneous speech synthesis by incorporating linguistic knowledge of filled pauses, emphasizing the importance of precise position and word prediction for naturalness and individuality.

Contribution

It introduces a new approach to personalized speech synthesis that models filled pauses, combining linguistic insights with empirical evaluation to improve naturalness and individualization.

Findings

01

Precise position prediction enhances speech naturalness.

02

Word prediction is crucial for capturing individual speech traits.

03

Personalized filled pause modeling improves speech synthesis quality.

Abstract

We present a comprehensive empirical study for personalized spontaneous speech synthesis on the basis of linguistic knowledge. With the advent of voice cloning for reading-style speech synthesis, a new voice cloning paradigm for human-like and spontaneous speech synthesis is required. We, therefore, focus on personalized spontaneous speech synthesis that can clone both the individual's voice timbre and speech disfluency. Specifically, we deal with filled pauses, a major source of speech disfluency, which is known to play an important role in speech generation and communication in psychology and linguistics. To comparatively evaluate personalized filled pause insertion and non-personalized filled pause prediction methods, we developed a speech synthesis method with a non-personalized external filled pause predictor trained with a multi-speaker corpus. The results clarify the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ndkgit339/fastspeech2-filled_pause_speech_synthesis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Topic Modeling · Speech Recognition and Synthesis