PSST! Prosodic Speech Segmentation with Transformers
Nathan Roll, Calbert Graham, Simon Todd

TL;DR
This paper fine-tunes a pretrained speech-to-text model to accurately segment prosodic units in speech, achieving high accuracy with minimal data and computational resources, and introduces filtering techniques to enhance performance.
Contribution
It repurposes Whisper for prosodic segmentation, demonstrating high accuracy and robustness without large labeled datasets or extensive compute, and explores filtering methods to improve out-of-sample performance.
Findings
Achieved 95.8% accuracy in prosodic segmentation.
Low pass filtering at 3.2 kHz improves out-of-distribution performance.
Outperforms previous methods with less data and compute.
Abstract
Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Phonetics and Phonology Research
