Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training
Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li, Shujie Liu,, Furu Wei

TL;DR
This paper introduces supervision-guided codebook methods for masked prediction pre-training in speech recognition, improving accuracy and efficiency over unsupervised approaches by leveraging phoneme alignments and supervised features.
Contribution
It proposes two novel supervision-guided codebook generation methods, PBERT and CTC clustering, enhancing speech pre-training and recognition performance.
Findings
Up to 17.0% relative WER reduction compared to baselines
Improved pre-training efficiency and accuracy
Good transferability to non-ASR speech tasks
Abstract
Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to interpret. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance and also the pre-training efficiency, either through decoding with a hybrid ASR system to generate phoneme-level alignments (named PBERT), or performing clustering on the supervised speech features extracted from an end-to-end CTC model (named CTC clustering). Both the hybrid and CTC models are trained on the same small amount of labeled speech as used in fine-tuning. Experiments demonstrate significant superiority of our methods to various SSL and self-training baselines, with up to 17.0% relative WER reduction. Our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
