Optimized Self-supervised Training with BEST-RQ for Speech Recognition
Ilja Baumann, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

TL;DR
This paper enhances the BEST-RQ self-supervised speech recognition method by introducing KL divergence regularization and multi-codebook extensions, leading to significant WER reductions and faster, more stable training.
Contribution
It introduces novel optimizations to BEST-RQ, including KL divergence regularization and multi-codebook clustering, achieving state-of-the-art results in speech recognition.
Findings
11.2% relative WER reduction on LibriSpeech test-clean
4.5% additional WER reduction with combined loss
up to 30.6% relative WER improvement on test-other
Abstract
Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech Recognition and Synthesis
