Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP
Jinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su,, Jing Guo, Benlai Tang, Fengjie Zhu

TL;DR
This paper introduces a novel two-stage automatic prosody annotation method using contrastive pretraining and multi-modal fusion, significantly improving prosodic boundary detection in TTS systems with high accuracy and robustness.
Contribution
The paper presents a new automatic prosody annotation pipeline combining contrastive pretraining and multi-modal fusion, advancing the state-of-the-art in prosodic boundary detection.
Findings
Achieves 0.72 F1 for Prosodic Word boundary detection.
Achieves 0.93 F1 for Prosodic Phrase boundary detection.
Demonstrates robustness to limited data scenarios.
Abstract
In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques
