Unsupervised Mismatch Localization in Cross-Modal Sequential Data with   Application to Mispronunciations Localization

Wei Wei; Huang Hengguan; Gu Xiangming; Wang Hao; Wang Ye

arXiv:2205.02670·cs.LG·January 10, 2023

Unsupervised Mismatch Localization in Cross-Modal Sequential Data with Application to Mispronunciations Localization

Wei Wei, Huang Hengguan, Gu Xiangming, Wang Hao, Wang Ye

PDF

Open Access

TL;DR

This paper introduces an unsupervised hierarchical Bayesian deep learning model, ML-VAE, that effectively locates content mismatches between speech and text sequences, such as mispronunciations, without requiring labeled data.

Contribution

The paper presents a novel unsupervised deep learning framework, ML-VAE, with a specialized training procedure for mismatch localization in cross-modal sequential data, especially speech-text alignment.

Findings

01

ML-VAE accurately locates mismatches in speech-text data

02

The model operates without human annotations

03

Effective hierarchical Bayesian modeling of cross-modal relationships

Abstract

Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis