Prosody Labeling with Phoneme-BERT and Speech Foundation Models
Tomoki Koriyama

TL;DR
This paper introduces a novel prosody labeling model combining acoustic features from self-supervised speech models and linguistic features from phoneme-based foundation models, improving prosodic label prediction accuracy for Japanese.
Contribution
It presents a new approach that integrates speech and linguistic foundation models for more accurate automatic prosody annotation.
Findings
Achieved 89.8% accuracy in accent labels
Achieved 93.2% accuracy in pitch accents
Achieved 94.3% accuracy in break indices
Abstract
This paper proposes a model for automatic prosodic label annotation, where the predicted labels can be used for training a prosody-controllable text-to-speech model. The proposed model utilizes not only rich acoustic features extracted by a self-supervised-learning (SSL)-based model or a Whisper encoder, but also linguistic features obtained from phoneme-input pretrained linguistic foundation models such as PnG BERT and PL-BERT. The concatenation of acoustic and linguistic features is used to predict phoneme-level prosodic labels. In the experimental evaluation on Japanese prosodic labels, including pitch accents and phrase break indices, it was observed that the combination of both speech and linguistic foundation models enhanced the prediction accuracy compared to using either a speech or linguistic input alone. Specifically, we achieved 89.8% prediction accuracy in accent labels,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
