Blind phoneme segmentation with temporal prediction errors
Paul Michel, Okko R\"as\"anen, Roland Thiolli\`ere, Emmanuel Dupoux

TL;DR
This paper introduces an unsupervised method for phoneme segmentation that leverages prediction errors from sequence models to identify speech boundaries, showing promising results on the TIMIT dataset.
Contribution
It presents a novel approach using error profiles from sequence prediction models for unsupervised phoneme segmentation, which improves over similar existing methods.
Findings
Effective boundary detection via local maxima in prediction error
Improved segmentation accuracy on TIMIT dataset
Unsupervised approach reduces need for labeled data
Abstract
Phonemic segmentation of speech is a critical step of speech recognition systems. We propose a novel unsupervised algorithm based on sequence prediction models such as Markov chains and recurrent neural network. Our approach consists in analyzing the error profile of a model trained to predict speech features frame-by-frame. Specifically, we try to learn the dynamics of speech in the MFCC space and hypothesize boundaries from local maxima in the prediction error. We evaluate our system on the TIMIT dataset, with improvements over similar methods.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
