TL;DR
This paper presents an end-to-end pipeline that inserts SSML tags into French text to improve prosody control in TTS, significantly enhancing speech naturalness and expressiveness.
Contribution
It introduces a novel cascaded architecture with fine-tuned LLMs for prosody prediction and SSML generation, advancing expressiveness in French speech synthesis.
Findings
Achieved 99.2% F1 in break placement
Reduced pitch, rate, volume errors by 25-40%
Significant perceptual quality improvement (MOS from 3.20 to 3.87)
Abstract
Despite recent advances, synthetic voices often lack expressiveness due to limited prosody control in commercial text-to-speech (TTS) systems. We introduce the first end-to-end pipeline that inserts Speech Synthesis Markup Language (SSML) tags into French text to control pitch, speaking rate, volume, and pause duration. We employ a cascaded architecture with two QLoRA-fine-tuned Qwen 2.5-7B models: one predicts phrase-break positions and the other performs regression on prosodic targets, generating commercial TTS-compatible SSML markup. Evaluated on a 14-hour French podcast corpus, our method achieves 99.2% F1 for break placement and reduces mean absolute error on pitch, rate, and volume by 25-40% compared with prompting-only large language models (LLMs) and a BiLSTM baseline. In perceptual evaluation involving 18 participants across over 9 hours of synthesized audio, SSML-enhanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
