Likelihood-based semi-supervised model selection with applications to speech processing
Christopher M. White, Sanjeev P. Khudanpur, and Patrick J. Wolfe

TL;DR
This paper introduces a semi-supervised likelihood-based model selection method for speech processing that uses unlabeled data and robust statistics to improve model choice without costly labeled datasets.
Contribution
It develops a novel semi-supervised framework leveraging automatic labeling and robust statistical tests for model selection in speech systems, reducing reliance on labeled data.
Findings
Effective model selection using unlabeled speech data.
Robust likelihood ratio tests outperform traditional methods.
Potential applicability to other machine learning domains.
Abstract
In conventional supervised pattern recognition tasks, model selection is typically accomplished by minimizing the classification error rate on a set of so-called development data, subject to ground-truth labeling by human experts or some other means. In the context of speech processing systems and other large-scale practical applications, however, such labeled development data are typically costly and difficult to obtain. This article proposes an alternative semi-supervised framework for likelihood-based model selection that leverages unlabeled data by using trained classifiers representing each model to automatically generate putative labels. The errors that result from this automatic labeling are shown to be amenable to results from robust statistics, which in turn provide for minimax-optimal censored likelihood ratio tests that recover the nonparametric sign test as a limiting case.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
