Unsupervised Representation Learning of Speech for Dialect   Identification

Suwon Shon; Wei-Ning Hsu; James Glass

arXiv:1809.04458·eess.AS·September 13, 2018

Unsupervised Representation Learning of Speech for Dialect Identification

Suwon Shon, Wei-Ning Hsu, James Glass

PDF

TL;DR

This paper introduces an unsupervised learning approach using a factorized hierarchical variational autoencoder to improve dialect identification by disentangling content from speaker and channel variations, enhancing robustness especially in low-resource scenarios.

Contribution

The paper presents a novel FHVAE-based method for unsupervised feature learning that improves dialect identification accuracy and robustness to domain mismatch, leveraging unlabeled data.

Findings

01

FHVAE features outperform conventional acoustic features and i-vectors in supervised DID tasks.

02

The approach effectively leverages unlabeled data to improve performance in low-resource settings.

03

Disentanglement reduces the impact of speaker and channel variability on dialect identification.

Abstract

In this paper, we explore the use of a factorized hierarchical variational autoencoder (FHVAE) model to learn an unsupervised latent representation for dialect identification (DID). An FHVAE can learn a latent space that separates the more static attributes within an utterance from the more dynamic attributes by encoding them into two different sets of latent variables. Useful factors for dialect identification, such as phonetic or linguistic content, are encoded by a segmental latent variable, while irrelevant factors that are relatively constant within a sequence, such as a channel or a speaker information, are encoded by a sequential latent variable. The disentanglement property makes the segmental latent variable less susceptible to channel and speaker variation, and thus reduces degradation from channel domain mismatch. We demonstrate that on fully-supervised DID tasks, an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.