Supervision-Guided Codebooks for Masked Prediction in Speech   Pre-training

Chengyi Wang; Yiming Wang; Yu Wu; Sanyuan Chen; Jinyu Li; Shujie Liu,; Furu Wei

arXiv:2206.10125·cs.CL·June 22, 2022

Supervision-Guided Codebooks for Masked Prediction in Speech Pre-training

Chengyi Wang, Yiming Wang, Yu Wu, Sanyuan Chen, Jinyu Li, Shujie Liu,, Furu Wei

PDF

Open Access

TL;DR

This paper introduces supervision-guided codebook methods for masked prediction pre-training in speech recognition, improving accuracy and efficiency over unsupervised approaches by leveraging phoneme alignments and supervised features.

Contribution

It proposes two novel supervision-guided codebook generation methods, PBERT and CTC clustering, enhancing speech pre-training and recognition performance.

Findings

01

Up to 17.0% relative WER reduction compared to baselines

02

Improved pre-training efficiency and accuracy

03

Good transferability to non-ASR speech tasks

Abstract

Recently, masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition. It usually requires a codebook obtained in an unsupervised way, making it less accurate and difficult to interpret. We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance and also the pre-training efficiency, either through decoding with a hybrid ASR system to generate phoneme-level alignments (named PBERT), or performing clustering on the supervised speech features extracted from an end-to-end CTC model (named CTC clustering). Both the hybrid and CTC models are trained on the same small amount of labeled speech as used in fine-tuning. Experiments demonstrate significant superiority of our methods to various SSL and self-training baselines, with up to 17.0% relative WER reduction. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing