The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech

Julio Cesar Galdino; Sidney Evaldo Leal; Leticia Gabriella De Souza; Rodrigo de Freitas Lima; Antonio Nelson Fornari Mendes Moreira; Arnaldo Candido Junior; Miguel Oliveira Jr.; Edresson Casanova; Sandra M. Alu\'isio

arXiv:2511.14779·cs.CL·November 20, 2025

The Impact of Prosodic Segmentation on Speech Synthesis of Spontaneous Speech

Julio Cesar Galdino, Sidney Evaldo Leal, Leticia Gabriella De Souza, Rodrigo de Freitas Lima, Antonio Nelson Fornari Mendes Moreira, Arnaldo Candido Junior, Miguel Oliveira Jr., Edresson Casanova, Sandra M. Alu\'isio

PDF

Open Access

TL;DR

This study examines how explicit prosodic segmentation annotations, both manual and automatic, influence the naturalness and intelligibility of speech synthesized from spontaneous Brazilian Portuguese, highlighting the benefits of manual segmentation.

Contribution

It investigates the impact of explicit prosodic segmentation on spontaneous speech synthesis quality, comparing manual and automatic annotations using a non-autoregressive model.

Findings

01

Training with prosodic segmentation improves speech intelligibility and naturalness.

02

Manual segmentation introduces more variability, enhancing prosody.

03

Both approaches reproduce expected nuclear accent patterns, with manual aligning more closely to natural contours.

Abstract

Spontaneous speech presents several challenges for speech synthesis, particularly in capturing the natural flow of conversation, including turn-taking, pauses, and disfluencies. Although speech synthesis systems have made significant progress in generating natural and intelligible speech, primarily through architectures that implicitly model prosodic features such as pitch, intensity, and duration, the construction of datasets with explicit prosodic segmentation and their impact on spontaneous speech synthesis remains largely unexplored. This paper evaluates the effects of manual and automatic prosodic segmentation annotations in Brazilian Portuguese on the quality of speech synthesized by a non-autoregressive model, FastSpeech 2. Experimental results show that training with prosodic segmentation produced slightly more intelligible and acoustically natural speech. While automatic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsPhonetics and Phonology Research · Speech Recognition and Synthesis · Voice and Speech Disorders