Pitch Accent Detection improves Pretrained Automatic Speech Recognition

David Sasu; Natalie Schluter

arXiv:2508.04814·cs.CL·August 8, 2025

Pitch Accent Detection improves Pretrained Automatic Speech Recognition

David Sasu, Natalie Schluter

PDF

TL;DR

This paper demonstrates that integrating a pitch accent detection module into semi-supervised ASR systems significantly improves recognition accuracy and prosodic cue retention, especially under limited data conditions.

Contribution

The authors introduce a joint ASR and pitch accent detection model that enhances performance and closes the F1-score gap for pitch accent detection, advancing prosody-aware speech recognition.

Findings

01

F1-score for pitch accent detection improved by 41%

02

Word error rate (WER) decreased by 28.3% on LibriSpeech

03

Joint training enhances prosodic cue retention in ASR

Abstract

We show the performance of Automatic Speech Recognition (ASR) systems that use semi-supervised speech representations can be boosted by a complimentary pitch accent detection module, by introducing a joint ASR and pitch accent detection model. The pitch accent detection component of our model achieves a significant improvement on the state-of-the-art for the task, closing the gap in F1-score by 41%. Additionally, the ASR performance in joint training decreases WER by 28.3% on LibriSpeech, under limited resource fine-tuning. With these results, we show the importance of extending pretrained speech models to retain or re-learn important prosodic cues such as pitch accent.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.