O-1: Self-training with Oracle and 1-best Hypothesis
Murali Karthick Baskar, Andrew Rosenberg, Bhuvana Ramabhadran, Kartik, Audhkhasi

TL;DR
O-1 is a novel self-training objective for speech recognition that reduces bias, unifies training and evaluation metrics, and significantly improves recognition accuracy across multiple datasets.
Contribution
The paper introduces O-1, a faster variant of EMBR, that enhances oracle hypothesis boosting and works with both supervised and unsupervised data, improving recognition performance.
Findings
O-1 closes 80% of the gap between actual and oracle WER on SpeechStew.
O-1 achieves 13-25% relative improvement over EMBR on SpeechStew datasets.
O-1 reduces the WER gap by 12% with respect to the oracle on in-house data.
Abstract
We introduce O-1, a new self-training objective to reduce training bias and unify training and evaluation metrics for speech recognition. O-1 is a faster variant of Expected Minimum Bayes Risk (EMBR), that boosts the oracle hypothesis and can accommodate both supervised and unsupervised data. We demonstrate the effectiveness of our approach in terms of recognition on publicly available SpeechStew datasets and a large-scale, in-house data set. On Speechstew, the O-1 objective closes the gap between the actual and oracle performance by 80\% relative compared to EMBR which bridges the gap by 43\% relative. O-1 achieves 13\% to 25\% relative improvement over EMBR on the various datasets that SpeechStew comprises of, and a 12\% relative gap reduction with respect to the oracle WER over EMBR training on the in-house dataset. Overall, O-1 results in a 9\% relative improvement in WER over EMBR,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Speech and Audio Processing
