Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of   SSWP

Jinzuomu Zhong; Yang Li; Hui Huang; Korin Richmond; Jie Liu; Zhiba Su,; Jing Guo; Benlai Tang; Fengjie Zhu

arXiv:2309.05423·eess.AS·June 12, 2024

Multi-Modal Automatic Prosody Annotation with Contrastive Pretraining of SSWP

Jinzuomu Zhong, Yang Li, Hui Huang, Korin Richmond, Jie Liu, Zhiba Su,, Jing Guo, Benlai Tang, Fengjie Zhu

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel two-stage automatic prosody annotation method using contrastive pretraining and multi-modal fusion, significantly improving prosodic boundary detection in TTS systems with high accuracy and robustness.

Contribution

The paper presents a new automatic prosody annotation pipeline combining contrastive pretraining and multi-modal fusion, advancing the state-of-the-art in prosodic boundary detection.

Findings

01

Achieves 0.72 F1 for Prosodic Word boundary detection.

02

Achieves 0.93 F1 for Prosodic Phrase boundary detection.

03

Demonstrates robustness to limited data scenarios.

Abstract

In expressive and controllable Text-to-Speech (TTS), explicit prosodic features significantly improve the naturalness and controllability of synthesised speech. However, manual prosody annotation is labor-intensive and inconsistent. To address this issue, a two-stage automatic annotation pipeline is novelly proposed in this paper. In the first stage, we use contrastive pretraining of Speech-Silence and Word-Punctuation (SSWP) pairs to enhance prosodic information in latent representations. In the second stage, we build a multi-modal prosody annotator, comprising pretrained encoders, a text-speech fusing scheme, and a sequence classifier. Experiments on English prosodic boundaries demonstrate that our method achieves state-of-the-art (SOTA) performance with 0.72 and 0.93 f1 score for Prosodic Word and Prosodic Phrase boundary respectively, while bearing remarkable robustness to data…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jzmzhong/Automatic-Prosody-Annotator-with-SSWP-CLAP
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and dialogue systems · Natural Language Processing Techniques