Biased Self-supervised learning for ASR

Florian L. Kreyssig; Yangyang Shi; Jinxi Guo; Leda Sari; Abdelrahman; Mohamed; Philip C. Woodland

arXiv:2211.02536·cs.CL·November 7, 2022

Biased Self-supervised learning for ASR

Florian L. Kreyssig, Yangyang Shi, Jinxi Guo, Leda Sari, Abdelrahman, Mohamed, Philip C. Woodland

PDF

Open Access

TL;DR

This paper introduces biased self-supervised learning techniques for automatic speech recognition, improving performance and training efficiency, especially for streaming models, by fine-tuning target sequence models and adapting MPPT loss computation.

Contribution

It proposes a novel biasing method for self-supervised learning in ASR and a variant of MPPT suitable for low-footprint streaming models, enhancing accuracy and training speed.

Findings

01

Biased training improves WER by 15.5% and 23.8% on test-other.

02

Streaming models see a 44.1% reduction in WER.

03

Proposed methods outperform unbiased training in ASR tasks.

Abstract

Self-supervised learning via masked prediction pre-training (MPPT) has shown impressive performance on a range of speech-processing tasks. This paper proposes a method to bias self-supervised learning towards a specific task. The core idea is to slightly finetune the model that is used to obtain the target sequence. This leads to better performance and a substantial increase in training speed. Furthermore, this paper proposes a variant of MPPT that allows low-footprint streaming models to be trained effectively by computing the MPPT loss on masked and unmasked frames. These approaches are evaluated for automatic speech recognition on the Librispeech corpus, where 100 hours of data served as the labelled data and 860 hours as the unlabelled data. The biased training outperforms the unbiased training by 15.5% after 250k updates and 23.8% after 100k updates on test-other. For the streaming…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Speech and Audio Processing · Music and Audio Processing