Latent Dirichlet Allocation Based Acoustic Data Selection for Automatic Speech Recognition
Mortaza (Morrie) Doulaty, Thomas Hain

TL;DR
This paper introduces a novel data selection method using Acoustic Latent Dirichlet Allocation (aLDA) to improve automatic speech recognition by choosing the most relevant training data from large, diverse datasets.
Contribution
It proposes aLDA as a data similarity criterion for selecting in-domain training data, significantly enhancing speech recognition performance over existing methods.
Findings
aLDA-based selection outperforms random and posterior-based methods
Selected data improves speech recognition accuracy
Method effectively handles large, diverse datasets
Abstract
Selecting in-domain data from a large pool of diverse and out-of-domain data is a non-trivial problem. In most cases simply using all of the available data will lead to sub-optimal and in some cases even worse performance compared to carefully selecting a matching set. This is true even for data-inefficient neural models. Acoustic Latent Dirichlet Allocation (aLDA) is shown to be useful in a variety of speech technology related tasks, including domain adaptation of acoustic models for automatic speech recognition and entity labeling for information retrieval. In this paper we propose to use aLDA as a data similarity criterion in a data selection framework. Given a large pool of out-of-domain and potentially mismatched data, the task is to select the best-matching training data to a set of representative utterances sampled from a target domain. Our target data consists of around 32 hours…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Natural Language Processing Techniques
