Semi-Supervised Speech Recognition via Graph-based Temporal Classification
Niko Moritz, Takaaki Hori, Jonathan Le Roux

TL;DR
This paper introduces a graph-based temporal classification method for semi-supervised speech recognition that leverages N-best pseudo-labels, significantly improving accuracy over standard approaches by better utilizing label uncertainties.
Contribution
It proposes a novel graph-based temporal classification (GTC) objective that effectively incorporates N-best pseudo-labels in semi-supervised ASR training, enhancing label accuracy.
Findings
GTC outperforms standard pseudo-labeling methods.
Approaches near oracle-level performance with manual N-best selection.
Effectively exploits label uncertainties from N-best hypotheses.
Abstract
Semi-supervised learning has demonstrated promising results in automatic speech recognition (ASR) by self-training using a seed ASR model with pseudo-labels generated for unlabeled data. The effectiveness of this approach largely relies on the pseudo-label accuracy, for which typically only the 1-best ASR hypothesis is used. However, alternative ASR hypotheses of an N-best list can provide more accurate labels for an unlabeled speech utterance and also reflect uncertainties of the seed ASR model. In this paper, we propose a generalized form of the connectionist temporal classification (CTC) objective that accepts a graph representation of the training labels. The newly proposed graph-based temporal classification (GTC) objective is applied for self-training with WFST-based supervision, which is generated from an N-best list of pseudo-labels. In this setup, GTC is used to learn not only…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
