Optimized Self-supervised Training with BEST-RQ for Speech Recognition

Ilja Baumann; Dominik Wagner; Korbinian Riedhammer; Tobias Bocklet

arXiv:2501.16131·cs.SD·January 28, 2025

Optimized Self-supervised Training with BEST-RQ for Speech Recognition

Ilja Baumann, Dominik Wagner, Korbinian Riedhammer, Tobias Bocklet

PDF

Open Access

TL;DR

This paper enhances the BEST-RQ self-supervised speech recognition method by introducing KL divergence regularization and multi-codebook extensions, leading to significant WER reductions and faster, more stable training.

Contribution

It introduces novel optimizations to BEST-RQ, including KL divergence regularization and multi-codebook clustering, achieving state-of-the-art results in speech recognition.

Findings

01

11.2% relative WER reduction on LibriSpeech test-clean

02

4.5% additional WER reduction with combined loss

03

up to 30.6% relative WER improvement on test-other

Abstract

Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence as an additional regularizing loss and multi-codebook extension per cluster derived from low-level feature clustering. Preliminary experiments on train-100 split of LibriSpeech result in a relative improvement of 11.2% on test-clean by using multiple codebooks, utilizing a combination of cross-entropy and Kullback-Leibler divergence further reduces the word error rate by 4.5%. The proposed optimizations on full LibriSpeech pre-training and fine-tuning result in relative word error rate improvements of up to 23.8% on test-clean and 30.6% on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis