Effectiveness of self-supervised pre-training for speech recognition

Alexei Baevski; Michael Auli; Abdelrahman Mohamed

arXiv:1911.03912·cs.CL·May 20, 2020·100 cites

Effectiveness of self-supervised pre-training for speech recognition

Alexei Baevski, Michael Auli, Abdelrahman Mohamed

PDF

Open Access 2 Repos

TL;DR

This paper evaluates self-supervised pre-training methods for speech recognition, showing that quantization-based algorithms like vq-wav2vec improve accuracy and enable effective models with minimal labeled data.

Contribution

It introduces a direct fine-tuning approach of BERT models on transcribed speech and compares raw audio versus spectral features for pre-training.

Findings

01

vq-wav2vec improves accuracy over non-quantized methods

02

Near-zero transcribed data can achieve competitive speech recognition performance

03

Significant WER reduction with minimal labeled data

Abstract

We compare self-supervised representation learning algorithms which either explicitly quantize the audio data or learn representations without quantization. We find the former to be more accurate since it builds a good vocabulary of the data through vq-wav2vec [1] to enable learning of effective representations in subsequent BERT training. Different to previous work, we directly fine-tune the pre-trained BERT models on transcribed speech using a Connectionist Temporal Classification (CTC) loss instead of feeding the representations into a task-specific model. We also propose a BERT-style model learning directly from the continuous audio data and compare pre-training on raw audio to spectral features. Fine-tuning a BERT model on 10 hour of labeled Librispeech data with a vq-wav2vec vocabulary is almost as good as the best known reported system trained on 100 hours of labeled data on…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech Recognition and Synthesis · Music and Audio Processing · Speech and Audio Processing

MethodsLinear Layer · Residual Connection · Attention Dropout · Linear Warmup With Linear Decay · Weight Decay · Refunds@Expedia|||How do I get a full refund from Expedia? · Dense Connections · Adam · WordPiece · Softmax