A Convolutional Deep Markov Model for Unsupervised Speech Representation Learning
Sameer Khurana, Antoine Laurent, Wei-Ning Hsu, Jan Chorowski, Adrian, Lancucki, Ricard Marxer, James Glass

TL;DR
This paper introduces ConvDMM, a probabilistic deep neural model for unsupervised speech representation learning, which outperforms existing self-supervised methods and benefits low-resource phoneme recognition.
Contribution
The paper presents ConvDMM, a novel Gaussian state-space model with deep neural network-based emission and transition functions for unsupervised speech feature extraction.
Findings
ConvDMM features outperform multiple self-supervised methods on phoneme classification.
ConvDMM complements existing self-supervised features, improving recognition results.
ConvDMM enables better phoneme recognition in low-resource settings.
Abstract
Probabilistic Latent Variable Models (LVMs) provide an alternative to self-supervised learning approaches for linguistic representation learning from speech. LVMs admit an intuitive probabilistic interpretation where the latent structure shapes the information extracted from the signal. Even though LVMs have recently seen a renewed interest due to the introduction of Variational Autoencoders (VAEs), their use for speech representation learning remains largely unexplored. In this work, we propose Convolutional Deep Markov Model (ConvDMM), a Gaussian state-space model with non-linear emission and transition functions modelled by deep neural networks. This unsupervised model is trained using black box variational inference. A deep convolutional neural network is used as an inference network for structured variational approximation. When trained on a large scale speech dataset…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
