Learning Dependencies of Discrete Speech Representations with Neural   Hidden Markov Models

Sung-Lin Yeh; Hao Tang

arXiv:2210.16659·cs.CL·November 1, 2022

Learning Dependencies of Discrete Speech Representations with Neural Hidden Markov Models

Sung-Lin Yeh, Hao Tang

PDF

Open Access

TL;DR

This paper introduces neural hidden Markov models to learn discrete speech representations, capturing dependencies among latent variables to improve phonetic information accessibility and segmentation accuracy.

Contribution

It proposes a neural hidden Markov model framework that models Markovian dependencies among latent variables in speech, enhancing phonetic representation learning.

Findings

01

Dependencies improve phonetic segmentation accuracy

02

Enhanced cluster purity of phones

03

Better access to phonetic information

Abstract

While discrete latent variable models have had great success in self-supervised learning, most models assume that frames are independent. Due to the segmental nature of phonemes in speech perception, modeling dependencies among latent variables at the frame level can potentially improve the learned representations on phonetic-related tasks. In this work, we assume Markovian dependencies among latent variables, and propose to learn speech representations with neural hidden Markov models. Our general framework allows us to compare to self-supervised models that assume independence, while keeping the number of parameters fixed. The added dependencies improve the accessibility of phonetic information, phonetic segmentation, and the cluster purity of phones, showcasing the benefit of the assumed dependencies.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Speech and dialogue systems