A Simple HMM with Self-Supervised Representations for Phone Segmentation
Gene-Ping Yang, Hao Tang

TL;DR
This paper demonstrates that peak detection on Mel spectrograms is a strong baseline for phonetic segmentation, and introduces a simple HMM using self-supervised representations and boundary features that outperforms previous methods.
Contribution
The paper shows that a simple peak detection baseline can outperform many self-supervised methods and proposes a straightforward HMM approach leveraging self-supervised features for improved segmentation.
Findings
Peak detection on Mel spectrograms is a strong baseline.
The proposed HMM with self-supervised features improves segmentation results.
The approach offers a versatile and generalized framework.
Abstract
Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques
MethodsFocus
