Prosody Labeling with Phoneme-BERT and Speech Foundation Models

Tomoki Koriyama

arXiv:2507.03912·eess.AS·July 8, 2025

Prosody Labeling with Phoneme-BERT and Speech Foundation Models

Tomoki Koriyama

PDF

TL;DR

This paper introduces a novel prosody labeling model combining acoustic features from self-supervised speech models and linguistic features from phoneme-based foundation models, improving prosodic label prediction accuracy for Japanese.

Contribution

It presents a new approach that integrates speech and linguistic foundation models for more accurate automatic prosody annotation.

Findings

01

Achieved 89.8% accuracy in accent labels

02

Achieved 93.2% accuracy in pitch accents

03

Achieved 94.3% accuracy in break indices

Abstract

This paper proposes a model for automatic prosodic label annotation, where the predicted labels can be used for training a prosody-controllable text-to-speech model. The proposed model utilizes not only rich acoustic features extracted by a self-supervised-learning (SSL)-based model or a Whisper encoder, but also linguistic features obtained from phoneme-input pretrained linguistic foundation models such as PnG BERT and PL-BERT. The concatenation of acoustic and linguistic features is used to predict phoneme-level prosodic labels. In the experimental evaluation on Japanese prosodic labels, including pitch accents and phrase break indices, it was observed that the combination of both speech and linguistic foundation models enhanced the prediction accuracy compared to using either a speech or linguistic input alone. Specifically, we achieved 89.8% prediction accuracy in accent labels,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.