Multiple-hypothesis CTC-based semi-supervised adaptation of end-to-end speech recognition
Cong-Thanh Do, Rama Doddipatla, Thomas Hain

TL;DR
This paper introduces a semi-supervised adaptation method for end-to-end speech recognition that leverages multiple ASR hypotheses in the CTC loss to improve robustness and reduce word error rates in various training scenarios.
Contribution
It proposes a novel multi-hypothesis CTC-based adaptation technique that effectively utilizes unlabeled data by integrating multiple ASR hypotheses during training.
Findings
Achieved 6.6% relative WER reduction in clean data scenarios.
Achieved 5.8% relative WER reduction in multi-condition scenarios.
Demonstrated robustness of the method across different training conditions.
Abstract
This paper proposes an adaptation method for end-to-end speech recognition. In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function. The integration of multiple ASR hypotheses helps alleviating the impact of errors in the ASR hypotheses to the computation of the CTC loss when ASR hypotheses are used. When being applied in semi-supervised adaptation scenarios where part of the adaptation data do not have labels, the CTC loss of the proposed method is computed from different ASR 1-best hypotheses obtained by decoding the unlabeled adaptation data. Experiments are performed in clean and multi-condition training scenarios where the CTC-based end-to-end ASR systems are trained on Wall Street Journal (WSJ) clean training data and CHiME-4 multi-condition training data,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
MethodsConnectionist Temporal Classification Loss
