On Scaling Contrastive Representations for Low-Resource Speech Recognition
Lasse Borgholt, Tycho Max Sylvester Tax, Jakob Drachmann Havtorn, Lars, Maal{\o}e, Christian Igel

TL;DR
This paper investigates the use of fixed contrastive speech representations for low-resource speech recognition, revealing limitations without fine-tuning and proposing a bidirectional extension to improve performance.
Contribution
It demonstrates the challenges of using fixed wav2vec 2.0 representations without fine-tuning and introduces a bidirectional extension that enhances recognition accuracy.
Findings
Performance drops without fine-tuning, especially in low-resource settings.
Wav2vec 2.0 representations are low-dimensional and decorrelation stabilizes training.
Bidirectional extension improves speech recognition performance.
Abstract
Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
