Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition
Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel, Povey, Sanjeev Khudanpur

TL;DR
This paper introduces Omni-temporal Classification (OTC), a new training method for speech recognition that effectively handles imperfect transcripts with high error rates, improving robustness over traditional methods.
Contribution
The paper proposes OTC, an extension of CTC that incorporates label uncertainties using weighted finite state transducers, enabling robust training with flawed data.
Findings
OTC maintains performance with transcripts up to 70% errors.
OTC outperforms standard CTC on datasets with imperfect transcripts.
Training with OTC prevents performance degradation caused by transcript errors.
Abstract
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data. However, human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models. In this paper, we propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties originating from such weak supervision. This allows the model to effectively learn speech-text alignments while accommodating errors present in the training transcripts. OTC extends the conventional CTC objective for imperfect transcripts by leveraging weighted finite state transducers. Through experiments conducted on the LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with OTC avoids performance degradation even with transcripts containing up to 70% errors, a scenario where CTC models fail completely. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Natural Language Processing Techniques · Topic Modeling
Methodsfail
