A Simple HMM with Self-Supervised Representations for Phone Segmentation

Gene-Ping Yang; Hao Tang

arXiv:2409.09646·cs.CL·September 23, 2024

A Simple HMM with Self-Supervised Representations for Phone Segmentation

Gene-Ping Yang, Hao Tang

PDF

Open Access

TL;DR

This paper demonstrates that peak detection on Mel spectrograms is a strong baseline for phonetic segmentation, and introduces a simple HMM using self-supervised representations and boundary features that outperforms previous methods.

Contribution

The paper shows that a simple peak detection baseline can outperform many self-supervised methods and proposes a straightforward HMM approach leveraging self-supervised features for improved segmentation.

Findings

01

Peak detection on Mel spectrograms is a strong baseline.

02

The proposed HMM with self-supervised features improves segmentation results.

03

The approach offers a versatile and generalized framework.

Abstract

Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging. Most approaches focus on improving phonetic representations with self-supervised learning, with the hope that the improvement can transfer to phonetic segmentation. In this paper, contrary to recent approaches, we show that peak detection on Mel spectrograms is a strong baseline, better than many self-supervised approaches. Based on this finding, we propose a simple hidden Markov model that uses self-supervised representations and features at the boundaries for phone segmentation. Our results demonstrate consistent improvements over previous approaches, with a generalized formulation allowing versatile design adaptations.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Advanced Data Compression Techniques

MethodsFocus