Alternative Pseudo-Labeling for Semi-Supervised Automatic Speech Recognition
Han Zhu, Dongji Gao, Gaofeng Cheng, Daniel Povey, Pengyuan Zhang,, Yonghong Yan

TL;DR
This paper introduces a novel alternative pseudo-labeling framework for semi-supervised automatic speech recognition that effectively handles noisy pseudo-labels through a generalized CTC loss, confidence-based error detection, and automatic thresholding.
Contribution
It proposes a new training objective framework that accepts alternative tokens, improves error detection with contrastive loss, and automates threshold tuning, advancing semi-supervised speech recognition.
Findings
Enhanced recognition accuracy with noisy pseudo-labels.
Effective error detection via contrastive CTC loss.
Automated thresholding reduces manual tuning effort.
Abstract
When labeled data is insufficient, semi-supervised learning with the pseudo-labeling technique can significantly improve the performance of automatic speech recognition. However, pseudo-labels are often noisy, containing numerous incorrect tokens. Taking noisy labels as ground-truth in the loss function results in suboptimal performance. Previous works attempted to mitigate this issue by either filtering out the nosiest pseudo-labels or improving the overall quality of pseudo-labels. While these methods are effective to some extent, it is unrealistic to entirely eliminate incorrect tokens in pseudo-labels. In this work, we propose a novel framework named alternative pseudo-labeling to tackle the issue of noisy pseudo-labels from the perspective of the training objective. The framework comprises several components. Firstly, a generalized CTC loss function is introduced to handle noisy…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and Audio Processing · Speech Recognition and Synthesis · Music and Audio Processing
MethodsConnectionist Temporal Classification Loss
