Unsupervised Speech Recognition
Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

TL;DR
This paper introduces wav2vec-U, an unsupervised speech recognition method that learns from unlabeled audio using self-supervised representations and adversarial training, significantly reducing error rates across multiple languages.
Contribution
The paper presents wav2vec-U, a novel unsupervised speech recognition approach that eliminates the need for labeled data by leveraging self-supervised features and adversarial learning.
Findings
Reduces phoneme error rate on TIMIT from 26.1 to 11.3.
Achieves 5.9% word error rate on Librispeech test-other.
Successfully applied to nine diverse languages, including low-resource ones.
Abstract
Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing
Methodsk-Means Clustering · wav2vec Unsupervised
