ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
Xiangheng He, Junjie Chen, Zixing Zhang, Bj\"orn W. Schuller

TL;DR
ProsodyFM is an unsupervised TTS model that significantly improves speech intelligibility by enhancing phrasing and intonation control through novel encoders and a flow-matching backbone, without requiring explicit prosodic labels.
Contribution
It introduces a prosody-aware TTS model with innovative encoders and a flow-matching framework that learns prosody patterns without explicit labels, improving naturalness and generalization.
Findings
Enhanced phrasing and intonation in synthesized speech.
Outperforms four state-of-the-art models in intelligibility.
Demonstrates superior generalization to unseen sentences and speakers.
Abstract
Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Phonetics and Phonology Research
MethodsSparse Evolutionary Training
