Phoneme Segmentation Using Self-Supervised Speech Models
Luke Strgar, David Harwath

TL;DR
This paper demonstrates that self-supervised speech models can be effectively transferred to phoneme segmentation tasks, outperforming previous methods in supervised and unsupervised settings on standard datasets.
Contribution
It introduces a transformer-based model with convolutional enhancements that leverages self-supervised representations for phoneme segmentation, and clarifies evaluation metric ambiguities.
Findings
Model surpasses state-of-the-art in supervised and unsupervised settings
Effective use of self-supervised pre-trained features
Clarification of evaluation metric definitions
Abstract
We apply transfer learning to the task of phoneme segmentation and demonstrate the utility of representations learned in self-supervised pre-training for the task. Our model extends transformer-style encoders with strategically placed convolutions that manipulate features learned in pre-training. Using the TIMIT and Buckeye corpora we train and test the model in the supervised and unsupervised settings. The latter case is accomplished by furnishing a noisy label-set with the predictions of a separate model, it having been trained in an unsupervised fashion. Results indicate our model eclipses previous state-of-the-art performance in both settings and on both datasets. Finally, following observations during published code review and attempts to reproduce past segmentation results, we find a need to disambiguate the definition and implementation of widely-used evaluation metrics. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
MethodsTest
