PSST! Prosodic Speech Segmentation with Transformers

Nathan Roll; Calbert Graham; Simon Todd

arXiv:2302.01984·cs.CL·February 28, 2025

PSST! Prosodic Speech Segmentation with Transformers

Nathan Roll, Calbert Graham, Simon Todd

PDF

Open Access 1 Repo

TL;DR

This paper fine-tunes a pretrained speech-to-text model to accurately segment prosodic units in speech, achieving high accuracy with minimal data and computational resources, and introduces filtering techniques to enhance performance.

Contribution

It repurposes Whisper for prosodic segmentation, demonstrating high accuracy and robustness without large labeled datasets or extensive compute, and explores filtering methods to improve out-of-sample performance.

Findings

01

Achieved 95.8% accuracy in prosodic segmentation.

02

Low pass filtering at 3.2 kHz improves out-of-distribution performance.

03

Outperforms previous methods with less data and compute.

Abstract

Self-attention mechanisms have enabled transformers to achieve superhuman-level performance on many speech-to-text (STT) tasks, yet the challenge of automatic prosodic segmentation has remained unsolved. In this paper we finetune Whisper, a pretrained STT model, to annotate intonation unit (IU) boundaries by repurposing low-frequency tokens. Our approach achieves an accuracy of 95.8%, outperforming previous methods without the need for large-scale labeled data or enterprise grade compute resources. We also diminish input signals by applying a series of filters, finding that low pass filters at a 3.2 kHz level improve segmentation performance in out of sample and out of distribution contexts. We release our model as both a transcription tool and a baseline for further improvements in prosodic segmentation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nathan-roll1/psst
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Phonetics and Phonology Research