Prosodic Prominence and Boundaries in Sequence-to-Sequence Speech Synthesis
Antti Suni, Sofoklis Kakouros, Martti Vainio, Juraj \v{S}imko

TL;DR
This paper investigates whether augmenting text input with automatically extracted prosodic labels improves the naturalness and accuracy of prosody in sequence-to-sequence speech synthesis, addressing limitations in reproducing local prosodic features.
Contribution
The study introduces a wavelet-based method to extract prosodic labels and demonstrates their effectiveness in enhancing prosodic accuracy in speech synthesis.
Findings
Prosodic labels improve f0 contour accuracy.
Energy contour fidelity is enhanced with labels.
Objective metrics show significant improvement.
Abstract
Recent advances in deep learning methods have elevated synthetic speech quality to human level, and the field is now moving towards addressing prosodic variation in synthetic speech.Despite successes in this effort, the state-of-the-art systems fall short of faithfully reproducing local prosodic events that give rise to, e.g., word-level emphasis and phrasal structure. This type of prosodic variation often reflects long-distance semantic relationships that are not accessible for end-to-end systems with a single sentence as their synthesis domain. One of the possible solutions might be conditioning the synthesized speech by explicit prosodic labels, potentially generated using longer portions of text. In this work we evaluate whether augmenting the textual input with such prosodic labels capturing word-level prominence and phrasal boundary strength can result in more accurate realization…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
