Multiple-hypothesis CTC-based semi-supervised adaptation of end-to-end   speech recognition

Cong-Thanh Do; Rama Doddipatla; Thomas Hain

arXiv:2103.15515·cs.CL·April 1, 2021

Multiple-hypothesis CTC-based semi-supervised adaptation of end-to-end speech recognition

Cong-Thanh Do, Rama Doddipatla, Thomas Hain

PDF

Open Access

TL;DR

This paper introduces a semi-supervised adaptation method for end-to-end speech recognition that leverages multiple ASR hypotheses in the CTC loss to improve robustness and reduce word error rates in various training scenarios.

Contribution

It proposes a novel multi-hypothesis CTC-based adaptation technique that effectively utilizes unlabeled data by integrating multiple ASR hypotheses during training.

Findings

01

Achieved 6.6% relative WER reduction in clean data scenarios.

02

Achieved 5.8% relative WER reduction in multi-condition scenarios.

03

Demonstrated robustness of the method across different training conditions.

Abstract

This paper proposes an adaptation method for end-to-end speech recognition. In this method, multiple automatic speech recognition (ASR) 1-best hypotheses are integrated in the computation of the connectionist temporal classification (CTC) loss function. The integration of multiple ASR hypotheses helps alleviating the impact of errors in the ASR hypotheses to the computation of the CTC loss when ASR hypotheses are used. When being applied in semi-supervised adaptation scenarios where part of the adaptation data do not have labels, the CTC loss of the proposed method is computed from different ASR 1-best hypotheses obtained by decoding the unlabeled adaptation data. Experiments are performed in clean and multi-condition training scenarios where the CTC-based end-to-end ASR systems are trained on Wall Street Journal (WSJ) clean training data and CHiME-4 multi-condition training data,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsConnectionist Temporal Classification Loss