# wav2vec: Unsupervised Pre-training for Speech Recognition

**Authors:** Steffen Schneider, Alexei Baevski, Ronan Collobert, Michael Auli

arXiv: 1904.05862 · 2019-09-12

## TL;DR

This paper introduces wav2vec, an unsupervised pre-training method for speech recognition that learns audio representations from unlabeled data, significantly reducing the need for transcribed data and outperforming previous models.

## Contribution

wav2vec is the first to effectively pre-train on raw audio with a contrastive task, improving speech recognition with minimal labeled data.

## Key findings

- Reduces WER by up to 36% with limited labeled data
- Achieves 2.43% WER on the nov92 test set
- Outperforms Deep Speech 2 using much less labeled data

## Abstract

We explore unsupervised pre-training for speech recognition by learning representations of raw audio. wav2vec is trained on large amounts of unlabeled audio data and the resulting representations are then used to improve acoustic model training. We pre-train a simple multi-layer convolutional neural network optimized via a noise contrastive binary classification task. Our experiments on WSJ reduce WER of a strong character-based log-mel filterbank baseline by up to 36% when only a few hours of transcribed data is available. Our approach achieves 2.43% WER on the nov92 test set. This outperforms Deep Speech 2, the best reported character-based system in the literature while using two orders of magnitude less labeled training data.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1904.05862/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/1904.05862/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/1904.05862/full.md

---
Source: https://tomesphere.com/paper/1904.05862