Unsupervised Mismatch Localization in Cross-Modal Sequential Data with Application to Mispronunciations Localization
Wei Wei, Huang Hengguan, Gu Xiangming, Wang Hao, Wang Ye

TL;DR
This paper introduces an unsupervised hierarchical Bayesian deep learning model, ML-VAE, that effectively locates content mismatches between speech and text sequences, such as mispronunciations, without requiring labeled data.
Contribution
The paper presents a novel unsupervised deep learning framework, ML-VAE, with a specialized training procedure for mismatch localization in cross-modal sequential data, especially speech-text alignment.
Findings
ML-VAE accurately locates mismatches in speech-text data
The model operates without human annotations
Effective hierarchical Bayesian modeling of cross-modal relationships
Abstract
Content mismatch usually occurs when data from one modality is translated to another, e.g. language learners producing mispronunciations (errors in speech) when reading a sentence (target text) aloud. However, most existing alignment algorithms assume that the content involved in the two modalities is perfectly matched, thus leading to difficulty in locating such mismatch between speech and text. In this work, we develop an unsupervised learning algorithm that can infer the relationship between content-mismatched cross-modal sequential data, especially for speech-text sequences. More specifically, we propose a hierarchical Bayesian deep learning model, dubbed mismatch localization variational autoencoder (ML-VAE), which decomposes the generative process of the speech into hierarchically structured latent variables, indicating the relationship between the two modalities. Training such a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
