On Scaling Contrastive Representations for Low-Resource Speech   Recognition

Lasse Borgholt; Tycho Max Sylvester Tax; Jakob Drachmann Havtorn; Lars; Maal{\o}e; Christian Igel

arXiv:2102.00850·eess.AS·February 2, 2021

On Scaling Contrastive Representations for Low-Resource Speech Recognition

Lasse Borgholt, Tycho Max Sylvester Tax, Jakob Drachmann Havtorn, Lars, Maal{\o}e, Christian Igel

PDF

TL;DR

This paper investigates the use of fixed contrastive speech representations for low-resource speech recognition, revealing limitations without fine-tuning and proposing a bidirectional extension to improve performance.

Contribution

It demonstrates the challenges of using fixed wav2vec 2.0 representations without fine-tuning and introduces a bidirectional extension that enhances recognition accuracy.

Findings

01

Performance drops without fine-tuning, especially in low-resource settings.

02

Wav2vec 2.0 representations are low-dimensional and decorrelation stabilizes training.

03

Bidirectional extension improves speech recognition performance.

Abstract

Recent advances in self-supervised learning through contrastive training have shown that it is possible to learn a competitive speech recognition system with as little as 10 minutes of labeled data. However, these systems are computationally expensive since they require pre-training followed by fine-tuning in a large parameter space. We explore the performance of such systems without fine-tuning by training a state-of-the-art speech recognizer on the fixed representations from the computationally demanding wav2vec 2.0 framework. We find performance to decrease without fine-tuning and, in the extreme low-resource setting, wav2vec 2.0 is inferior to its predecessor. In addition, we find that wav2vec 2.0 representations live in a low dimensional subspace and that decorrelating the features of the representations can stabilize training of the automatic speech recognizer. Finally, we propose…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.