Biased Self-supervised learning for ASR
Florian L. Kreyssig, Yangyang Shi, Jinxi Guo, Leda Sari, Abdelrahman, Mohamed, Philip C. Woodland

TL;DR
This paper introduces biased self-supervised learning techniques for automatic speech recognition, improving performance and training efficiency, especially for streaming models, by fine-tuning target sequence models and adapting MPPT loss computation.
Contribution
It proposes a novel biasing method for self-supervised learning in ASR and a variant of MPPT suitable for low-footprint streaming models, enhancing accuracy and training speed.
Findings
Biased training improves WER by 15.5% and 23.8% on test-other.
Streaming models see a 44.1% reduction in WER.
Proposed methods outperform unbiased training in ASR tasks.
Abstract
Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Furthermore, this paper proposes a variant of MPPT that allows low-footprint streaming models to be trained effectively by computing the MPPT loss on masked and unmasked frames. These approaches are evaluated for automatic speech recognition on the Librispeech corpus, where 100 hours of data served as the labelled data and 860 hours as the unlabelled data. The biased training outperforms the unbiased training by 15.5% after 250k updates and 23.8% after 100k updates on test-other. For the streaming…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing
