Unsupervised Adaptation with Interpretable Disentangled Representations for Distant Conversational Speech Recognition
Wei-Ning Hsu, Hao Tang, James Glass

TL;DR
This paper introduces an unsupervised adaptation technique for speech recognition that synthesizes labeled data from unlabeled in-domain speech by disentangling linguistic and nuisance factors, significantly improving performance in distant conversational speech scenarios.
Contribution
It presents a novel method to learn interpretable speech representations and adapt models without labeled in-domain data, addressing domain mismatch in speech recognition.
Findings
Outperforms all baselines on the AMI dataset
Bridges over 77% of the gap between unadapted and in-domain models
Effectively handles channel mismatch in conversational speech
Abstract
The current trend in automatic speech recognition is to leverage large amounts of labeled data to train supervised neural network models. Unfortunately, obtaining data for a wide range of domains to train robust models can be costly. However, it is relatively inexpensive to collect large amounts of unlabeled data from domains that we want the models to generalize to. In this paper, we propose a novel unsupervised adaptation method that learns to synthesize labeled data for the target domain from unlabeled in-domain data and labeled out-of-domain data. We first learn without supervision an interpretable latent representation of speech that encodes linguistic and nuisance factors (e.g., speaker and channel) using different latent variables. To transform a labeled out-of-domain utterance without altering its transcript, we transform the latent nuisance variables while maintaining the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
