Unsupervised Speech Recognition

Alexei Baevski; Wei-Ning Hsu; Alexis Conneau; Michael Auli

arXiv:2105.11084·cs.CL·May 4, 2022·21 cites

Unsupervised Speech Recognition

Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, Michael Auli

PDF

Open Access 4 Repos 1 Video

TL;DR

This paper introduces wav2vec-U, an unsupervised speech recognition method that learns from unlabeled audio using self-supervised representations and adversarial training, significantly reducing error rates across multiple languages.

Contribution

The paper presents wav2vec-U, a novel unsupervised speech recognition approach that eliminates the need for labeled data by leveraging self-supervised features and adversarial learning.

Findings

01

Reduces phoneme error rate on TIMIT from 26.1 to 11.3.

02

Achieves 5.9% word error rate on Librispeech test-other.

03

Successfully applied to nine diverse languages, including low-resource ones.

Abstract

Despite rapid progress in the recent past, current speech recognition systems still require labeled training data which limits this technology to a small fraction of the languages spoken around the globe. This paper describes wav2vec-U, short for wav2vec Unsupervised, a method to train speech recognition models without any labeled data. We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training. The right representations are key to the success of our method. Compared to the best previous unsupervised work, wav2vec-U reduces the phoneme error rate on the TIMIT benchmark from 26.1 to 11.3. On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

Unsupervised Speech Recognition· slideslive

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

Methodsk-Means Clustering · wav2vec Unsupervised